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1  Introduction 

The  overall  goal  of  our  AFOSR  sponsored  research  program  is  to  develop  a  general  and 
systematic  foundation  and  methodologies  for  robust  rapid  change-point  detection  in  the 
context  of  fusing  noisy  data  from  heterogeneous  networked  sensors,  and  apply  them  to  model 
behavior  data  in  experiments  with  nncertain  onset  time  of  stimulus.  Our  results  offer  a 
deeper  understanding  how  hnmans  process  information  in  dynamic  environments  nnder  time 
pressure,  and  provide  new  mathematical  tools  to  model  behavior  data. 

Rapid  change-point  detection,  or  seqnential  change-point  detection,  has  a  variety  of  appli¬ 
cations  snch  as  industrial  quality  control,  signal  detection  and  biosurveillance.  The  classical 
version  of  this  problem,  where  one  monitors  a  single  univariate  independent  and  identically 
distributed  (i.i.d.)  data  stream,  is  a  well-developed  area,  and  many  classical  schemes  have 
been  developed  such  as  the  Shewhart’s  chart,  moving  average  control  charts.  Page’s  GUSUM 
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procedure,  Shiryaev-Roberts  procedure  and  scan  statistics.  In  nowadays  information  tech¬ 
nology  infrastructure  age,  rapid  change-point  detection  hnds  challenging  new  applications 
due  to  its  ubiquitous  nature.  However,  we  need  to  extend  the  theories  and  methodologies 
beyond  the  classical  i.i.d.  models. 

Our  AFOSR  sponsored  research  program  focuses  on  two  specihc  applications.  The  hrst 
is  the  multi-sensor  data  fusion  problem,  which  has  many  important  applications,  including 
cognition,  psychology,  sensor  network  systems,  surveillance  systems,  and  economic  theory  of 
teams.  Practically,  a  multi-sensor  system  can  be  a  collective  of  intelligent  sensors,  robots  or 
agents  acting  together  to  solve  a  common  task,  and  its  motivation  is  to  mimic  the  ongoing 
cognitive  process  used  by  humans  that  integrates  data  continually  from  their  senses  to  make 
inferences  about  the  external  world.  The  second  involves  in  psychology  and  neuroscience, 
where  researchers  model  the  human  (and  animal)  decision  making  process  as  a  form  of 
sequential  sampling  process.  We  illustrate  that  rapid  change-point  detection  methodologies 
provide  new  mathematical  tools  for  modelling. 

This  report  will  furnish  the  successes  and  achievements  accomplished  under  AFOSR 
funding. 

2  Overview  of  Accomplishments 

2.1  Efficient  Scalable  Global  Detection  Schemes 

In  the  modern  information  age,  one  often  faces  the  need  to  monitor  a  large  number  of  noisy 
data  streams  with  the  aim  of  offering  the  potential  for  early  detection  of  a  “trigger”  event. 
In  the  literature,  two  special  categories  of  global  monitoring  have  gained  a  lot  of  attention: 
one  is  consensus  detection  in  which  the  occurring  event  is  assumed  to  affect  all  data  streams 
abruptly  and  simultaneously,  and  the  other  is  the  case  when  one  and  only  one  data  stream  is 
known  to  be  affected  by  the  occurring  event,  e.g.,  multi-channel  detection  in  signal  detection. 

In  our  research,  we  consider  a  general  scenario  when  the  occurring  event  may  affect 
different  data  streams  differently.  For  instance,  we  may  know  that  the  event  can  affect  10 
out  of  100  data  streams  but  we  do  not  know  which  10  data  streams  will  be  affected.  In 
addition,  the  event  could  have  an  immediate  or  delayed  impact  on  affected  data  streams. 
One  example  is  biosurveillance  when  a  disease  outbreak  may  hrst  occur  in  some  unknown 
local  cities  or  counties  and  then  spread  out  to  other  regions.  Another  example  occurs  in 
target  detection  when  one  uses  several  different  sensor  types  such  as  radar,  sonar,  infrared 
and  magnetic  to  detect  a  target,  and  it  is  possible  that  some  (but  not  all)  sensors  can  provide 
earlier  information  about  the  target. 

Ideally,  one  would  like  to  develop  a  single  global  monitoring  scheme  that  can  detect  the 
occurring  event  as  quickly  as  possible  while  controlling  the  system-wise  global  false  alarm 
rate.  While  such  global  monitoring  schemes  can  be  found  by  the  standard  statistical  meth¬ 
ods  such  as  generalized  likelihood  ratios  or  mixture  likelihood  ratios  from  the  theoretical 
viewpoint,  they  are  generally  infeasible  from  the  computational  viewpoint  since  they  would 
require  us  to  do  an  exhaustive  search  over  all  possible  post-change  hypotheses  or  over  all 
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possible  combinations  of  affected  data  streams.  Hence,  it  is  desirable  to  find  a  global  mon¬ 
itoring  scheme  that  is  not  only  efficient,  but  also  scalable  in  the  sense  of  being  able  to  be 
implemented  to  monitor  a  large  number  of  data  streams  over  a  long  period  of  time. 

To  develop  a  scalable  global  scheme,  one  intuitively  appealing  approach  is  to  monitor  each 
local  data  stream  locally  through  some  classical  computationally  efficient  schemes  and  then 
combine  all  local  schemes  together  to  produce  a  single  global  scheme.  Unfortunately,  little 
research  has  been  done  on  how  to  combine  all  local  schemes  together  to  ensure  efficiency, 
and  to  the  best  of  our  knowledge,  only  a  naive  approach  has  been  proposed  so  far.  The  naive 
approach  is  to  raise  an  alarm  at  the  global  level  whenever  any  local  scheme  raises  a  local 
alarm.  This  naive  approach  is  very  effective  if  one  or  very  few  data  streams  are  affected, 
but  it  may  lose  efficiency  when  several  or  more  data  streams  can  provide  information  to  the 
occurring  event.  For  the  purpose  of  comparison,  below  the  global  scheme  corresponding  to 
this  naive  approach  will  be  referred  as  a  “MAX”  scheme,  since  it  can  be  thought  of  raising  a 
global  alarm  if  the  maximum  of  local  monitoring  statistics  is  too  large,  where,  if  necessary, 
one  can  normalize  the  local  schemes  so  that  the  local  detection  thresholds  are  the  same. 

Our  research  offers  new  approaches  to  combine  all  local  schemes  together  to  produce 
an  efficient  scalable  global  scheme  for  rapid  change-point  detection.  In  Mei  (2010),  we 
propose  to  raise  an  alarm  at  the  global  level  if  the  sum  of  local  monitoring  statistics  (e.g., 
local  CUSUM  statistics  in  the  logarithm  scale)  is  too  large,  and  theoretical  analysis  and 
numerical  simulations  show  that  the  corresponding  “SUM”  scheme  is  efficient  when  the 
number  of  affected  data  streams  is  very  large. 

In  Mei  (2011),  we  take  a  further  step  to  monitor  a  large  number  of  data  streams  via 
thresholding.  The  fundamental  idea  is  to  raise  a  global  alarm  based  on  the  sum  of  those 
local  detection  statistics  (e.g.,  local  CUSUM  statistics)  that  are  “large”  under  either  top-r 
thresholding  rules  or  hard  thresholding  or  both.  It  turns  out  that  the  corresponding  global 
schemes  include  both  MAX  and  SUM  schemes  as  special  cases  and  possess  certain  asymptotic 
optimality  properties  under  proper  criteria.  It  is  worth  pointing  out  that  the  thresholding 
idea  has  been  applied  for  high-dimensional  data  in  the  off-line  statistical  inference  literature 
to  improve  power  or  efficiency,  but  our  application  to  rapid  change-point  detection  is  new. 

Mathematically,  assume  that  in  a  system,  one  is  monitoring  K  data  streams  over  time  n, 
say,  for  /c  =  1, . . . ,  iF.  Initially,  the  system  is  “in  control”  and  the  distribution  of 

the  Xk^nS  is  fk  for  the  k-th  data  stream.  At  some  unknown  time  z/,  a  “trigger”  event  occurs 
and  affects  the  system  in  the  sense  that  the  density  function  of  the  observations  Xk^nS 
changes  from  fk  to  another  given  density  Qk  at  time  >  z/.  Without  loss  of  generality,  we 
assume  that  mini<fc<ii-  z/^  =  z/,  since  otherwise  the  system  is  affected  by  the  event  only  after 
the  time  z/'  =  mini<fc<x  z/^,  which  can  be  treated  as  the  new  change-point.  It  is  desired  to 
utilize  the  observed  data  streams  Xk^nS  to  raise  an  alarm  at  the  global  level  as  soon  as  the 
event  occurs  so  that  one  can  take  appropriate  action. 

A  standard  minimax  formulation  is  to  minimize  the  detection  delay  subject  to  a  false 
alarm  constraint.  The  latter  is  typical  of  the  form 

E<“>(r)  >  7,  (1) 
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where  T  is  the  stopping  rule  associated  with  the  detection  scheme,  7  >  0  is  a  pre-specihed 
constant,  and  denote  the  expectation  when  there  are  no  changes.  In  the  literature 

E{°°)(T)  is  often  called  as  the  average  run  length  to  false  alarm.  The  definition  of  detection 
delay  is  a  little  more  complicated,  as  it  needs  to  take  into  account  the  uncertainty  of  the 
change-point  v.  One  widely  used  dehnition  is  the  following  “worst  case”  detection  delay 


D(r)  = 


sup  ess  sup  ( (T 


v  +  iy 


(2) 


where  =  max(a;,0),  =  (Xi  [i  ,^_i],. . . , Xx,[i,i/-i])  denotes  past  global  information  at 

time  z/,  and  [1  ,^_i]  =  (X^.  1, . . . ,  ,^_i)  is  past  local  information  for  the  A;-th  data  stream. 

Before  discussing  the  existing  schemes  for  globally  monitoring,  let  us  consider  the  local 
schemes  for  monitoring  a  single  data  stream,  say,  the  /cth  data  stream.  Such  a  problem  has 
been  well  studied  in  the  sequential  change-point  detection  literature,  and  one  efficient  local 
scheme  is  Page’s  CUSUM  procedure  which  raise  a  local  alarm  as  soon  as  the  local  CUSUM 
statistic  exceeds  some  pre-specihed  threshold,  where  the  local  CUSUM  statistic  for  the  kth. 
data  stream  at  time  n  is  given  by 

Wk^n  =  max  1 0,  max  ^  log  | . 

In  practice,  the  local  CUSUM  statistic  114, n  can  be  computed  conveniently  online  via  a 
recursive  formula: 

lUfc.n  =  max  (Wk,n-i  +  log 

^  Jk[^k,n)  ^ 


and  lUfc^o  =  0. 

Now  let  us  go  back  our  global  monitoring  problem.  As  mentioned  before,  the  naive 
approach  is  the  “MAX”  scheme  that  raises  an  alarm  at  the  global  level  as  soon  as  one  of 
local  CUSUM  procedures  raises  a  local  alarm.  Specihcally,  the  “MAX”  scheme  raises  a 
global  alarm  at  time 


Umax(c)  =  hrst  time  n  >  1  such  that  max  II4n  >  c,  (4) 

l<fc<iC  ’ 

(=  cxD  if  such  n  does  not  exist)  where  c  >  0  is  a  pre-specihed  constant  chosen  to  satisfy  the 
false  alarm  constraint  (1). 

In  Mei  (2010),  we  propose  the  “SUM”  scheme  that  raises  an  alarm  when  the  sum  of  local 
CUSUM  statistics  is  too  large.  Specihcally,  at  time  n,  each  data  streams  calculates  its  local 
CUSUM  statistic  II4,n’s  as  in  (3),  and  then  one  will  raise  an  alarm  at  the  global  level  at 
time 

K 

Tsum{d)  =  hrst  time  n  >  1  such  that  IUfc,n  >  d,  (5) 

k=l 
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(=  oo  if  such  n  does  not  exist)  where  the  constant  d  >  0  is  some  suitably  chosen  constant. 
The  intuition  behind  the  “SUM”  rule  is  that  a  large  value  of  local  CUSUM  statistic  indicates 
a  possible  local  change,  and  thus  the  sum  of  local  CUSUM  statistics  is  a  reasonable  detection 
statistic  if  the  number  of  affected  data  streams  is  large  or  completely  unknown.  Intuitively, 
the  “MAX”  scheme  Tma^fc)  in  (4)  works  better  when  one  or  very  few  data  streams  are 
affected,  whereas  the  “SUM”  scheme  Ts-a^{d)  in  (5)  works  better  when  many  data  streams 
are  affected,  and  numerical  simulations  in  Mei  (2010)  indeed  verihed  this  intuition. 

Recently,  in  Mei  (2011),  motivated  by  the  scenario  when  one  has  a  prior  knowledge  that 
at  most  r  data  streams  will  be  affected  and  by  the  applications  in  censoring  sensor  networks, 
we  propose  the  following  new  global  schemes: 

•  The  top-r  thresholding  scheme.  At  each  time  n,  we  order  the  K  local  CUSUM 
statistics  fUgn,  •  •  • ,  from  largest  to  smallest:  IU(i),n  >  ^(2),n  >  •  •  •  >  W(^K),n- 

Then  the  top-r  thresholding  scheme  raises  a  global  alarm  at  time 

r 

Ntop,r{cb)  =  hrst  time  n  >  1  such  that  W{k)^n  >  o.  (6) 

k=l 


•  The  hard  thresholding  scheme.  At  time  n,  each  local  data  stream  calculates  its 
local  CUSUM  statistic  144, n  recursively  as  in  (3).  Then  at  time  n,  the  decision-maker 
will  take  a  “keep  or  kill”  policy  to  make  a  decision  based  on 


f  Wk,n,  if  fUfc,n  >  h 
\  NULL,  if  Wk,n  <  bk  ’ 


where  bk  >  0  is  the  local  censoring  (hard  threshold)  parameter  at  the  k-th  data  stream. 
To  be  more  concrete,  our  proposed  hard  thresholding  scheme  raises  a  global  alarm  at 
time 


K 

Nhardia,  b)  =  hrst  n  >  1  such  that  ^  Wk,nI{Wk,n  >  bk}  >  a.  (7) 

k=l 

A  “good”  choice  of  the  bk’s  is  bk  =  Pkb  for  k  =  1, . . . ,  K  for  some  constant  6  >  0  where 

_  ^{dki  fk) 

''‘“EEUteda 

can  be  thought  of  as  the  weight  of  the  /c-th  data  stream  in  the  overall  hnal  decision, 
and  I{gk,fk)  is  the  Kullback-Leibler  information  number  dehned  by 


^{.9k)  fk) 


log 


9k{x) 

fk{x) 


gk{x)dfi{x). 
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•  The  Combined  Thresholding  Scheme.  At  time  n,  each  local  data  stream  uses  the 
hard  threshold  method  to  generate  Uk,n  =  hldc,n-^{hl4,n  >  Pkb}i  and  the  decision  maker 
uses  the  top-r  thresholding  rule  to  combine  Uk,nS  to  make  a  decision.  Mathematically, 
the  combined  thresholding  scheme  raises  an  alarm  at  the  global  level  at  time 

r 

Ncomb,r{cbi  b)  =  hrst  n  >  1  such  that  E  U{k),n  ^  (8) 

k=l 


It  is  worth  mentioning  that  our  research  has  led  to  some  interesting  spin-off  results  in 
renewal  theory  and  applied  probability.  Asymptotic  analysis  of  the  properties  of  our  proposed 
new  schemes  involves  the  development  of  some  novel  methods  to  investigate  the  limiting 
behavior  of  W*  =  maxo<i<n  'n,k=i  ^k,i  under  different  scenarios,  where  Wk,i  =  max(lTfc,j_i  -|- 
with  Wkfl  =  0,  and  the  fk/^  are  K  independent  (not  necessarily  identical)  sequences 
of  random  variables.  The  CUSUM-like  statistic  Wk,i  and  its  extreme  value  process  Wf  play 
an  important  role  in  queue  theory,  where  Wk,i  is  the  waiting  time  of  the  Ath  customer. 
In  the  literature  the  properties  of  Wf  with  A'  =  1  has  been  investigated,  but  that  of  Wf 
with  K  >  2  has  not  been  studied  so  far.  During  the  process,  we  hnd  an  elegant  result 
on  the  first  ladder  epoch  r  where  all  K  independent  one-dimensional  random  walks  reach 
recording-setting  values  simultaneously.  Specifically,  in  Mei  (2010),  we  show  that  under  a 
general  condition,  E  (r)  =  nf=iE(r,),  where  is  the  first  weak  ladder  epoch  of  the  k-ih. 
one-dimensional  random  walk.  This  simple  and  interesting  result  is  a  new  contribution  to 
the  well-developed  renewal  theory. 

2.2  Non- homogeneous  Poisson 

Another  topic  of  our  research  is  concerned  with  the  scenario  when  the  observations  are  non- 
homogeneous  under  the  baseline.  Motivated  by  applications  in  bio  and  syndromic  surveil¬ 
lance,  we  investigate  the  problem  of  detecting  a  change  in  the  mean  of  Poisson  distributions 
after  taking  into  account  the  effects  of  population  size.  The  specific  data  motivating  our  re¬ 
search  concerns  male  thyroid  cancer  cases  (with  malignant  behavior)  in  New  Mexico  during 
1973-2005.  The  data  set  has  been  studied  before  in  the  biosurveillance  literature  and  is  avail¬ 
able  from  the  Surveillance,  Epidemiology,  and  End  Results  (SEER)  Program  at  the  National 
Cancer  Institute  that  collects  information  on  cancer  incidence,  mortality,  and  survival  from 
the  population-based  cancer  registries  in  the  United  States.  Figure  1  plots  three  different 
curves  related  to  this  data  set:  (1)  yearly  total  number  of  cancers  with  malignant  behavior; 
(2)  yearly  population  size  (of  males)  in  New  Mexico;  and  (3)  yearly  (crude)  incidence  rate 
per  100,000  (male)  population. 

An  interesting  goal  in  epidemiology  and  (bio)surveillance  is  to  determine  whether  or  not 
the  risk  for  male  thyroid  cancer  increases  over  time.  The  term  risk  here  essentially  means 
the  probability  of  developing  thyroid  cancer  in  a  given  year,  which  can  be  characterized  by 
the  incidence  rate  per  100,000  (male)  population;  see  the  plot  in  the  bottom  panel  of  Figure 
1.  Since  the  binomial  distribution  can  be  approximated  by  Poisson  distribution  with  the 
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Figure  1:  Three  time  series  data  of  male  thyroid  cancer  in  New  Mexico  during  1973-2005. 
Top:  the  left  panel  plots  the  total  number  of  male  thyroid  cancers  over  years,  and  the  right 
panel  illustrates  the  trend  of  the  male  population.  Bottom:  the  plot  is  of  the  crude  cancer 
incidence  per  100,000  population  over  years. 


same,  we  can  simply  assume  that  we  observe  two-dimensional  random  vectors  (/„,  Yn)  over 
time  n,  where  Yn  has  a  Poisson  distribution  with  mean  /r„  =  In^n-  Here  In,  Yn  and  A„  can  be 
thought  of  as  the  population  size  (in  the  units  of  100,000  population),  the  number  of  disease 
cases,  and  the  (unobservable  true)  incidence  rate  per  100,000  (male)  population  at  the  n-th 
year,  respectively. 

Under  a  very  simplified  setting,  the  A^’s,  e.g.,  the  incidence  rates  per  100,000  (male) 
population,  are  assumed  to  change  from  one  value  Aq  to  another  value  Ai  at  some  unknown 
time  u,  and  we  want  to  detect  such  a  change  as  soon  as  possible  if  it  occurs.  Note  that  we 
are  only  interested  in  detecting  a  change  in  the  risk  A„’s,  and  the  population  sizes  /„’s  can 
be  either  pre-specified  constants  or  observable  (possibly  dependent)  random  variables  whose 
distributions  are  nuisance  parameters  that  are  left  unspecified. 

To  construct  efficient  schemes,  it  is  natural  to  consider  the  family  of  generalized  like¬ 
lihood  ratio  (GLR)  schemes.  Note  that  the  problems  can  be  thought  of  as  testing  the 
null  hypothesis  Hq  :  u  =  oo  (no  change)  against  the  composite  alternative  hypothesis 
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Hi  \  1  <  V  <  oo  {a.  change  occurs),  the  logarithm  of  the  corresponding  GLR  statistic  of 
the  hrst  n  observations,  {(/i,  is  given  by 

Wn=  max  log-^^('(/i,Yi),---,(G,F„)y 
l<u<oo  droo  ^  ' 

Now  given  the  /j’s,  the  hi’s  are  conditionally  independent  with  a  conditional  probability 
density  function  (pdf)  fo{Yi\li)  =  {kXo)^^ / {Yi})  if  i  <  z/,  but  with  a  conditional  pdf 

fi{nk)  =  e-^^^^{kXiV^/{Y,\)  if*  >  z/.  Moreover,  the  distribution  of  the  /„’s  is  assumed  to  be 
the  same  under  Poo  or  and  for  the  hrst  n  observations,  {(h,  their  Pj,-distribution 

is  the  same  as  their  Poo-distribution  when  u  >  n,  due  to  the  uniqueness  of  the  pre-change 
distribution.  Hence,  the  logarithm  of  the  GLR  statistic  can  be  rewritten  as 


n 

max  y  log 

l<k<n-\-l  ^  ^ 
i=k 


umi) 

h{Y,\k) 


n 

max  >  Yi  log 

l<fc<n+l  ^  L 
i=k 


Ao 


k{Xi  —  Ao)  , 


(9) 


where  Y^^=n+i  ~  Thus,  under  our  setting,  the  GLR  scheme  raises  an  alarm  at  time 

TcLnid)  =  hrst  n  >  1  such  that  Wn  >  a,  (10) 


(=  cxD  if  such  n  does  not  exist),  where  the  constant  a  is  chosen  to  satisfy  the  false  alarm 
constraint.  For  the  purpose  of  online  implementation,  it  is  easy  to  see  that  Wn  in  (9)  enjoys 
a  recursive  formula  of  the  classical  GUSUM  statistics: 


Wn  =  max  0,  Wn-i  + 


Ynlog^-ln{Xi-Xo) 

Ao 


It  is  also  not  difficult  to  establish  the  asymptotic  optimality  properties  of  the  GLR  scheme 
under  the  classical  asymptotic  setting. 

So  far  we  have  “solved”  the  problem  by  our  favorite  GLR  methods,  but  perhaps  not 
solved  it  in  practice.  To  illustrate  that  despite  its  nice  asymptotic  optimality  properties,  the 
GLR  scheme  may  not  necessarily  be  as  ehective  as  one  expects  in  application,  we  propose 
two  ad-hoc  methods  for  comparison. 

Intuitively,  two  features  of  the  GLR  scheme  seem  questionable  in  the  context  of  non¬ 
stationary  population  sizes:  (i)  the  GLR  statistic  Wn  in  (9)  assigns  the  same  weight  to  the 
individual  log-likelihood  ratio  statistic  log  regardless  of  population  size  /j’s,  although 

the  Rj’s  with  larger  population  sizes  /j’s  surely  provide  more  information,  and  (ii)  the  GLR 
scheme  TcLR^d)  uses  the  constant  threshold  value  a  over  time.  Accordingly,  we  propose  two 
alternative  detection  schemes  to  take  into  account  the  effects  of  population  sizes. 

The  hrst  scheme  is  based  on  the  quasi-log-likelihood  ratio  statistics  that  normalize  each 
term  log  (gj  by  their  (conditional)  variances,  or  equivalently,  by  the  population 


sizes  /j’s  (up  to  a  constant).  This  leads  to  the  detection  statistic 


max 

l<fc<n+l 


fl(nik) 


max 

l<fc<n+l 


Ao 


(Ai  —  Ao) 


(11) 


Thus,  for  any  given  constant  b,  we  can  dehne  the  weighted  likelihood  ratio  (WLR)  scheme 

Twinib)  =  hrst  n  >  1  such  that  Wn  >  b.  (12) 


Another  motivation  for  the  WLR  scheme  Twinib)  in  (12)  is  based  on  Yn/ln,  a  natnral 
estimator  of  the  risk  or  the  disease  rate  per  100,000  population.  To  see  this,  if  we  pretend 
that  Yn/ln  is  Poisson  distribnted  with  mean  A„  (this  is  not  trne  nnder  onr  setting,  bnt  we  can 
still  use  it  to  construct  detection  schemes),  then  the  problem  becomes  the  classical  problem 
of  detecting  a  change  in  the  Poisson  mean  from  Aq  to  Ai,  and  the  corresponding  GLR  (or 
CUSUM)  procednre  is  just  the  WLR  scheme  Twinip)  in  (12). 

The  second  scheme  we  propose  is  to  use  the  GLR-based  statistic  Wn  in  (9),  but  with 
adaptive  thresholds  to  take  into  acconnt  population  size  effects.  Ideally,  one  wonld  like  to 
use  the  optimal  thresholds  or  boundaries,  say,  by  some  Bayesian  or  non-Bayesian  arguments, 
bnt  snch  bonndaries  seem  to  be  too  complicated  to  derive  explicitly.  For  simplicity,  we  use 
the  linear  boundaries:  InC.  Specihcally,  the  proposed  adaptive  threshold  method  (ATM) 
raises  an  alarm  at  time 


Tatm{c)  =  hrst  n  >  1  snch  that  Wn  >  InC,  (13) 

for  some  constant  c  >  0,  where  Wn  is  the  GLR  statistic  dehned  in  (9). 

It  is  important  to  point  ont  that  when  the  population  sizes  /„’s  are  eqnal  to  a  constant 
I  >  0,  then  the  three  detection  schemes,  TcLR{ci),TwLR{b)  and  Tatm{c),  not  only  are  eqniv- 
alent  (when  a  =  lb  =  Ic),  bnt  also  hold  the  exact  optimality  properties  of  Page’s  GUSUM 
procedures  for  the  i.i.d.  models.  However,  when  the  population  sizes  /„’s  vary,  these  three 
schemes  are  no  longer  equivalent. 

Indeed,  nnmerical  simnlation  stndies  illustrate  that  the  GLR  schemes  are  at  times  not  as 
efficient  as  two  families  of  ad-hoc  schemes  based  on  either  the  weighted  likelihood  ratios  or 
the  adaptive  threshold  method  that  adjust  the  effects  of  population  sizes.  Specihcally,  the 
GLR  scheme  is  the  best  scheme  with  the  smallest  worst-case  detection  delay  if  the  popnlation 
sizes  are  decreasing,  bnt  it  is  the  worst  scheme  if  the  popnlation  sizes  are  increasing;  the 
WLR  scheme  is  the  worse  scheme  if  the  population  sizes  are  decreasing,  bnt  the  best  if  the 
popnlation  sizes  are  increasing.  The  adaptive  threshold  scheme  Tatm{c)  seems  to  be  robust 
in  the  sense  of  small  detection  delays  Ei(T),  no  matter  whether  the  popnlation  sizes  are 
increasing  or  decreasing. 

In  other  words,  onr  simulations  suggest  that  when  one  has  prior  information  that  popu¬ 
lation  sizes  are  increasing  or  decreasing,  one  use  the  best  among  these  three  schemes.  When 
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there  is  uncertainty  about  the  trends  of  population  sizes,  one  may  want  to  use  the  adaptive 
threshold  scheme  Tatm{c)  to  take  advantage  of  its  robustness  properties.  In  particular,  de¬ 
spite  its  asymptotic  optimality  properties  under  the  classical  asymptotic  setting,  the  GLR 
scheme  indeed  can  perform  very  poorly  in  hnite-sample  numerical  simulations,  especially  in 
the  typical  scenarios  of  biosurveillance  when  the  population  sizes  are  increasing. 

To  explain  this,  in  Mei,  Han  and  Tsui  (2011),  we  develop  a  new  asymptotic  optimality 
analysis  under  a  new  asymptotic  setting  that  is  more  suitable  to  our  hnite-sample  numerical 
simulations.  In  addition,  we  extend  our  approaches  to  a  general  setting  with  arbitrary  prob¬ 
ability  distributions,  as  well  as  to  the  continuous-time  setting  involving  the  multiplicative 
intensity  models  for  Poisson  processes,  although  further  research  is  needed. 

2.3  Decentralized  Sequential  Multi-Hypothesis  Testing 

In  Wang  and  Mei  (2010),  we  investigate  decentralized  sequential  multi-hypothesis  testing 
problems.  In  recent  years,  the  decentralized  version  of  sequential  hypothesis  testing  problems 
has  gained  a  great  amount  of  attention  and  has  been  applied  into  a  wide  range  of  applications 
such  as  military  surveillance,  target  tracking  and  classihcation,  and  data  hltering.  Under  a 
widely  used  decentralized  setting,  raw  data  are  observed  at  a  set  of  geographically  deployed 
sensors,  whereas  the  hnal  decision  is  made  at  a  central  location,  often  called  the  fusion 
center.  The  key  feature  here  is  that  raw  observations  at  the  local  sensors  are  generally  not 
directly  accessible  by  the  fusion  center,  and  the  local  sensors  need  to  send  quantized  summary 
messages  (generally  belonging  to  a  hnite  alphabet  set)  to  the  fusion  center.  This  is  due  to 
limited  communication  bandwidth  and  requirements  of  high  communication  robustness. 

Unfortunately,  decentralized  sequential  hypothesis  testing  problems  are  very  challenging, 
and  to  the  best  of  our  knowledge,  existing  research  is  restricted  to  testing  two  simple  hypothe¬ 
ses,  and  it  has  been  an  open  problem  to  hnd  any  sort  of  asymptotically  optimal  solutions  for 
the  decentralized  sequential  testing  problem  when  testing  M  >  3  hypotheses.  This  is  not 
surprising,  because  even  in  the  centralized  version,  it  requires  sophisticated  mathematical 
and  statistical  techniques  and  only  asymptotic  optimality  results  are  available. 

The  goal  of  our  research  along  this  direction  is  to  tackle  this  open  problem  by  develop¬ 
ing  a  class  of  asymptotically  optimal  decentralized  sequential  procedures  for  testing  M  >  3 
hypotheses.  To  do  so,  a  major  challenge  we  need  to  overcome  is  hnding  the  “optimal  quan¬ 
tizers”  that  can  best  send  quantized  summary  sensor  messages  from  the  local  sensors  to 
the  fusion  center  so  as  to  lose  as  little  information  as  possible.  Intuitively,  such  a  quan¬ 
tizer  should  depend  on  the  true  distribution  of  the  raw  data,  which  is  unknown,  and  thus 
stationary  quantizers  are  generally  not  optimal.  In  addition,  since  a  quantizer  can  be  any 
measurable  function  as  long  as  its  range  is  in  the  given  hnite  alphabet  set,  it  resides  in  an 
inhnite  dimensional  functional  space.  Hence  it  is  essential  to  investigate  the  form  of  the 
“optimal  quantizers”  so  that  one  can  reduce  the  inhnite  dimensional  functional  space  to 
a  hnite-dimensional  parameter  space  for  the  purpose  of  theoretical  analysis  and  numerical 
computation.  Note  that  when  testing  M  =  2  hypotheses,  Tsitsiklis  {IEEE  Trans.  Commun., 
1993)  and  Veeravalli,  Basar,  and  Poor  {IEEE  Trans.  Inf.  Theory,  1993)  showed  that  the 
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optimal  quantizers  can  be  found  from  the  family  of  monotone  likelihood  ratio  quantizers 
(MLRQ),  whose  form  is  dehned  up  to  a  hnite  number  of  parameters.  Unfortunately,  such 
a  result  does  not  apply  to  the  case  of  testing  M  >  3  hypotheses.  To  find  the  form  of  the 
optimal  quantizers  for  multi-hypotheses,  we  combine  three  existing  methodologies  together: 

•  two-stage  tests  in  Stein  {Ann.  Math.  Statist..,  1945)  and  Kiefer  and  Sacks  {Ann.  Math. 
Statist.,  1963),  or  equivalently,  tandem  quantizers  in  Mei  {IEEE  Trans.  Inf.  Theory, 
2008); 

•  unambiguous  likelihood  guantizers  (ULQ)  in  Tsitsiklis  {IEEE  Trans.  Commun.,  1993); 
and 

•  randomized  guantizers,  see  Chernoff  {Ann.  Math.  Statist.,  1959)  for  a  closely  related 
topic  on  randomized  experiments. 

Mathematically,  we  assume  that  a  sensor  network  consists  of  K  local  sensors  labeled 
by  S^ ,  ...,  S^  and  a  fusion  center  which  makes  a  hnal  decision  when  stopping  taking 
observations.  At  each  time  step  n  =  1,  2, . . . ,  each  local  sensor  S^  observes  raw  data  {Xff\ 
and  sends  quantized  summary  messages  {Uf}  to  the  fusion  center.  Here  the  quantized 
messages  {Uf}  are  required  to  belong  to  a  hnite  alphabet,  say,  {0, 1, ...  ,/^  —  1},  due  to 
limited  communication  bandwidth  or  requirements  of  high  communication  robustness.  In 
other  words,  the  fusion  center  does  not  have  direct  access  to  the  raw  data  {X!^},  and  have  to 
utilize  the  quantized  sensor  messages  {Uf}  to  make  a  hnal  decision.  If  necessary,  the  fusion 
center  can  send  feedback  to  the  local  sensors  so  as  to  improve  the  system  efficiency.  To 
be  more  concrete,  we  further  assume  that  at  time  n,  for  each  k  =  1,2, . . . ,  K,  the  quantized 
sensor  message  at  the  kth  local  sensor  is  of  the  form 

Kf_,)  G  {0, 1, . . . ,  -  1}  (14) 

where  the  feedback  is  dehned  by 

(15) 

and  =  {Ui, . . . ,  Uf_i)  denotes  all  past  local  sensor  messages.  That  is,  the  quantizer 

0^  is  a  function  used  by  sensor  S^  to  map  the  local  raw  data  X^  into  {0, 1, . . . ,  —  1},  and 

the  choice  of  0^  can  depend  on  the  feedback  and  can  be  a  randomized  function  (to  be 
discussed  later). 

In  decentralized  sequential  multihypothesis  testing  problems,  there  are  M  hypotheses 
regarding  the  distribution  P  of  the  raw  data  {Xff\\ 

Um  P  =  Pm,  m  =  0, 1, . . .  ,M  -  1.  (16) 

Under  each  P^,  the  raw  data  X!^  at  local  sensor  S^  are  i.i.d.  with  density  /m(-)  with  respect 
to  a  common  underlying  measure,  and  the  raw  data  {Xff\  are  assumed  to  be  independent 
among  diherent  sensors.  Hence  the  distributions  of  the  raw  data  under  P^  are  completely 
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determined  by  the  K  densities:  ,  f^.  Below  we  simply  state  that  the  true  state  of 

nature  is  m  or  if  the  hypothesis  is  true. 

A  decentralized  sequential  test  S  consists  of  a  rule  to  determine  the  sensor  messages,  a 
stopping  time  N  used  by  the  fusion  center  and  a  final  decision  rule  D  G  {0, 1, . . . ,  M  —  1}  that 
chooses  one  of  the  M  probability  measures  Pm’s  based  on  the  information  up  to  time  N  at 
the  fusion  center.  In  a  Bayesian  framework,  we  assign  prior  probabilities  n  =  (tto,  . . . ,  tim-i) 
to  the  M  hypotheses  Hq,  •  •  • ,  Let  c  >  0  be  the  cost  per  time  step  until  stopping, 

and  let  W{m,m')  be  the  loss  of  making  decision  D  =  m'  when  the  true  state  is  P^-  It 
is  standard  to  assume  that  W{m,m)  =  0  but  W{m,m')  >  0  for  any  m  ^  m',  i.e.,  no  loss 
occurs  if  and  only  if  a  correct  decision  is  made.  Then  when  the  true  state  of  nature  is  Pm, 
the  total  expected  cost  of  a  decentralized  test  6  is 

7^e(^;  m)  =  cEm(iV)  +  W{m,  m')P^{D  =  m'} 

m' 


where  Em  is  the  expectation  operator  under  Pm.  Hence,  the  Bayes  risk  of  the  decentralized 
test  5  is 

7^e(5)  =  5^7^m7^,(^;m).  (17) 

In  our  research,  we  develop  asymptotically  Bayes  procedures  by  introducing  a  class  of 
“two-stage”  decentralized  sequential  tests  in  which  each  local  sensor  uses  two  stationary 
(possibly  randomized)  local  quantizers  with  at  most  one  switch  between  these  two  quantizers. 
This  type  of  tests  are  useful  because  they  allows  the  fusion  center  to  hrst  make  a  preliminary 
guess  about  the  true  state  of  nature  and  then  optimize  the  procedure  accordingly. 

Our  proposed  two-stage  test  (5(c)  can  be  dehned  as  follows.  In  the  first  stage  of  (5(c),  the 
local  sensor  can  use  any  “reasonable”  stationary  deterministic  quantizer  and  the  fusion  center 
needs  to  make  a  preliminary  guess  about  the  true  state  of  nature.  The  only  requirement  is 
that  as  the  cost  c  — )■  0,  the  probabilities  of  making  incorrect  preliminary  guess  go  to  zero 
but  the  time  steps  taken  at  this  first  stage  become  negligible  as  compared  to  those  of  the 
overall  procedure  (or  the  second  stage). 

To  be  more  concrete,  let  u{c)  G  (0,1/2)  be  a  function  of  c  such  that  u{c)  — )■  0  and 
logM(c)/logc  — )■  0  when  c  — )■  0,  e.g.,  u{c)  =  1/|  logc|.  Choose  a  deterministic  quantizer 
such  that  I{m,  m';  0°)  >  0  for  any  two  states  0<m^m'<M  —  1,  and  let  the  local  sensor 
use  the  stationary  quantizer  0°  to  send  i.i.d.  sensor  messages  Un  =  0°(X„)  to  the  fusion 
center.  Then  the  fusion  center  faces  a  classical  sequential  detection  problem  with  the  i.i.d. 
sensor  messages  Ufis  as  inputs,  and  thus  it  is  intuitively  appealing  to  make  a  preliminary 
decision  based  on  posterior  distributions.  Specihcally,  at  each  time  step  n  =  0, 1,  •  •  • ,  the 
fusion  center  updates  recursively  the  posterior  distribution  (7ro,n,  TTpn,  •  •  • ,  T^M-i,n)  as  follows: 


^m,n 


'^m,n—lfm(^Ufij  (j) 


E 


0<m'<M 


_1  '^m' 0*^) 


Then  the  fusion  center  will  stop  the  hrst  stage  at  time  step 


A"o  =  minjn  >  0  :  max  {nmn}  >  1  —  'w(c)| 

0<m<Ar-l 
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and  when  stopped,  the  fusion  center  makes  a  preliminary  decision 

Dq  =  arg  max  'Km,No- 

0<m<M-l 

Note  that  the  preliminary  decision  Dq  is  well-dehned  because  the  maximum  value  of  Tim, No 
is  attained  at  only  one  index  m  due  to  the  dehnition  of  iVo  and  the  fact  that  u{c)  <1/2.  For 
the  purpose  of  practical  implementation,  the  preliminary  decision  Dq  can  be  transmitted  to 
the  local  sensor  through  a  feedback  of  log2  M  bits. 

In  the  second  stage  of  our  proposed  test  d(c),  the  local  sensor  will  switch  to  another 
stationary  (likely  randomized)  quantizer  that  may  depend  on  the  preliminary  decision  Dq. 
Without  loss  of  generality,  we  assume  that  the  local  sensor  uses  the  stationary  quantizer  (1)^ 
when  the  preliminary  decision  at  the  hrst  stage  is  Dq  =  m  for  m  =  0,l,...,M  —  1.  Here  we 
put  a  bar  over  (pm  to  emphasize  that  it  is  likely  a  randomized  quantizer  when  optimized,  and 
we  will  postpone  the  detailed  discussion  about  how  to  implement  randomized  quantizers  to 
the  next  subsection. 

Now  at  the  second  stage,  the  fusion  center  shall  ignore  the  preliminary  decision  Dq  and 
continue  to  update  the  posterior  distribution  {nQ^n,  ■  ■  ■  ,7iM-i,n)  with  the  sensor  messages 
generated  from  the  new  quantizer  (pm  when  Dq  =  m  (how  to  update  will  be  discussed  in 
the  next  subsection).  Then  the  fusion  center  will  stop  the  second  stage  (hence  the  whole 
procedure)  at  time  step 

N  =  minjn  >  Nq  :  max  {Tr^nl  >  1  —  c| 

and  when  stopped,  the  fusion  center  makes  a  hnal  decision 

D  =  arq  max 

0<m<M— 1  ’ 

In  Wang  and  Mei  (2010),  the  asymptotic  optimality  properties  of  our  proposed  test  are 
established.  During  the  process,  a  spin-off  project  is  to  investigate  the  effect  of  quantization 
on  the  second  or  higher  order  moments  of  log-likelihood  ratios.  The  well-known  Kullback- 
Leibler’s  inequality  states  that  quantization  cannot  increase  Kullback-Leibler  information 
number  which  is  just  the  hrst  moment  of  of  log-likelihood  ratios.  In  our  latest  manuscript, 
Wang  and  Mei  (2010)  (see  appendix  4.2),  we  extend  the  Kullback-Leibler’s  inequality  to  show 
that  quantization  may  result  in  an  increase  of  the  second  or  higher  order  moments  of  log- 
likelihood  ratios,  but  such  an  increase  is  bounded  by  a  universal  constant  that  only  depends 
on  the  value  of  the  moment  (such  a  constant  is  2/e  for  the  second  moment).  Furthermore, 
we  also  illustrate  that  the  result  can  be  used  to  provide  a  simpler  sufficient  condition  for  the 
asymptotic  optimality  theory  in  decentralized  sequential  detection  problems. 

2.4  The  2-CUSUM  Model  for  Two-Choice  Reaction  Time 

The  mathematical  models  have  played  an  important  roles  in  psychology  and  behavior  sci¬ 
ences,  and  one  important  example  is  Ratcliff’s  diffusion  model  that  has  been  successfully 
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applied  to  account  for  speed-accuracy  relations,  mean  response  times  and  the  shapes  of 
reaction-time  distributions  for  the  paradigm  of  two  alternative  forced  choice,  i.e.,  when  the 
subject  in  the  experiment  has  to  choose  one  of  two  decisions,  e.g.,  “Signal  is  in  Position  1”  or 
“Signal  is  in  Position  2.”  However,  when  the  subject  in  the  two-choice  experiment  is  allowed 
to  say  “No  signal,”  it  is  unclear  how  to  “best”  extend  Ratcliff’s  diffusion  model  so  that  the 
corresponding  mathematical  model  can  take  into  account  of  uncertainty  about  the  time  of 
occurrence  of  the  signal. 

In  our  research,  spurred  by  the  experiments  with  uncertainty  about  the  time  of  occurrence 
of  the  signal,  we  extend  Ratcliff’s  diffusion  model  to  the  2-CUSUM  model  by  emphasizing 
natural  connections  to  the  sequential  change-point  detection  problems.  The  proposed  2- 
CUSUM  model  is  based  on  the  optimal  procedure  in  the  problem  of  detecting  a  change  in 
the  mean  of  Brownian  motion  from  /i  =  0  (no  signals)  to  /i  =  ±2A  (a  signal  appears). 

To  state  more  rigorously,  in  Ratcliff’s  diffusion  model,  it  is  assumed  that  each  subject 
in  an  experiment  has  an  internal  (real-valued)  process  that  summarizes  the  information 
accumulation  over  time.  The  process  runs  between  two  thresholds  and  is  terminated  as 
soon  as  one  of  the  thresholds  is  crossed,  and  the  subject  will  choose  one  of  two  decisions 
depending  on  which  threshold  is  crossed.  In  order  for  the  model  to  be  flexible  to  describe 
the  data  in  real-world  applications,  several  parameters  in  the  internal  process  are  assumed 
to  be  random  effects  to  account  for  intra-trial  variability  or  subject-specihc  effects.  The  key 
mathematical  tool  is  based  on  the  optimality  of  Wald’s  sequential  probability  ratio  tests 
(SPRT).  Mathematically,  in  the  diffusion  model,  the  response  time  {RT)  is  given  by 

RT  =  d  +  Ter, 

where  Ter  ~  Uniform(S't)  and  the  decision  time  d  is  dehned  as  a  SPRT-type  stopping  time 

d  =  inf{t  >  0  :  2;  -T  ^  (0,  a)},  (18) 

where  Zt  =  vt  +  sWt,  {HA;  t  >  0}  is  a  standard  Brownian  motion,  the  starting  point  2:  ~ 
Uniform(S'2),  the  standard  deviation  s  is  a  constant,  and  the  drift  rate  v  ~  N{vo,ri‘^).  Here 
the  drift  rate  v  denotes  the  rate  of  accumulation  of  information  and  it  is  determined  by  the 
quality  of  the  information  extracted  from  the  stimulus.  In  an  experiment,  the  value  of  drift 
rate  would  be  different  for  each  stimulus  condition  that  differed  in  difficulty. 

Our  proposed  2-CUSUM  process  model  is  as  follows.  For  a  hxed  A  >  0,  for  any  t  >  0, 
consider  two  process 

Ro,t  =  —At  +  Zt  +  {z  —  \)  and  Ri^t  =  —At  —  Zt  +  {a  —  z  —  X).  (19) 

For  /  =  1,  2,  dehne  the  corresponding  CUSUM  processes 

HA,t  = -R/,t  -  min(0,  inf  R/  J,  (20) 

and  the  Page’s  CUSUM  procedures 


d*i  =  inf{t  >  0  :  HA,t  >  a}. 
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Then  the  decision  time  is  dehned  as 


d*  =  dl) .  (21) 

There  are  two  new  features  in  the  proposed  2-CUSUM  process  models.  The  hrst  one 
is  information  discounting  in  the  sense  of  adding  a  negative  term  —\t  to  both  cumulative 
information  processes,  i?o,t  and  Ri^f  The  second  is  memory  forgetting  in  the  sense  that  the 
decision  makers  will  “forget”  or  “ignore”  those  “not  so  signihcant”  information  that  is  not 
consistent  with  their  prior  information,  and  will  start  afresh  to  cumulative  information  until 
there  is  sufficient  evident  to  prove  that  the  prior  information  is  incorrect.  Moreover,  the 
discounting  information  rate  A  can  also  be  thought  of  the  detection  limit  for  individuals,  i.e., 
human  being  essentially  ignore  those  information  whose  rate  is  less  than  A  in  their  decision 
making  statistics  Wi^fs.  In  Moutakides  and  Mei  (2010)  (see  appendix  4.3),  we  investigate  the 
properties  of  d*  in  (21),  and  show  that  d*  is  equivalent  to  d  in  (18)  when  A  =  0.  Furthermore, 
when  A  >  0,  d*  is  asymptotically  equivalent  to  d  (in  the  sense  that  both  are  asymptotically 
normally  distributed)  when  the  drift  rate  v  ^  [—A,  A],  but  they  have  completely  distributions 
when  the  drift  rate  v  G  [—A,  A]. 
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MONITORING  A  LARGE  NUMBER  OF  DATA  STREAMS 

VIA  THRESHOLDING 

By  Yajun  Mei*  , 

Georgia  Institute  of  Technology 

The  sequential  change-point  detection  problem  is  studied  in  a 
general  context  of  monitoring  a  large  number  of  data  streams  when 
the  trigger  event  may  affect  different  data  streams  differently,  e.g., 
the  subset  of  affected  data  stream  is  unknown  and  the  event  could 
have  an  immediate  or  delayed  impact  on  affected  data  streams.  Mo¬ 
tivated  by  the  scenario  when  one  has  a  prior  knowledge  that  at  most 
r  data  streams  will  be  affected  and  by  the  applications  in  censor¬ 
ing  sensor  networks,  we  propose  scalable  global  monitoring  schemes 
based  on  the  sum  of  those  local  CUSUM  statistics  that  are  “large” 
under  either  hard  thresholding  or  top-r  thresholding  rules  or  both. 

The  proposed  schemes  are  shown  to  possess  certain  asymptotic  op¬ 
timality  properties  in  the  simplest  model,  and  extensions  to  more 
complicated  models  are  discussed. 

1.  Introduction.  Sequential  change-point  detection,  or  more  generally, 
sequential  methodologies,  have  a  variety  of  applications  such  as  industrial 
quality  control,  signal  detection  and  biosurveillance.  The  classical  version 
of  this  problem,  where  one  monitors  a  single  univariate  independent  and 
identically  distributed  (i.i.d.)  data  stream,  is  a  well-developed  area,  and 
many  classical  schemes  have  been  developed  such  as  the  Shewhart’s  chart 
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(Shewhart  [24]),  moving  average  control  charts,  Page’s  CUSUM  procedure 
(Page  [18]),  Shiryaev-Roberts  procedure  (Shiryaev  [25],  Roberts  [23])  and 
scan  statistics.  All  these  classical  schemes  not  only  hold  attractive  theoretical 
properties,  but  also  are  computationally  simple.  See,  for  example,  Lorden 
[11],  Poliak  [19,  20],  Moustakides  [16],  Lai  [9,  10],  Kulldorff  [8]. 

In  the  modern  information  age,  one  often  faces  the  need  to  monitor  a 
large  number  of  data  streams  with  the  aim  of  offering  the  potential  for  early 
detection  of  a  “trigger”  event.  In  the  literature,  two  special  categories  of 
global  monitoring  have  gained  a  lot  of  attention:  one  is  consensus  detection 
in  which  the  occurring  event  is  assumed  to  affect  all  data  streams  abruptly 
and  simultaneously,  and  the  other  is  the  case  when  one  and  only  one  data 
stream  is  known  to  be  affected  by  the  occurring  event,  e.g.,  multi-channel 
detection  in  signal  detection.  See,  for  example,  Montgomery  [15],  Veeravalli 
[31],  Tartakovsky  and  Veeravalli  [28]  and  Tartakovsky  et  al.  [29]. 

In  this  article,  we  consider  a  general  scenario  when  the  occurring  event 
may  affect  different  data  streams  differently.  For  instance,  we  may  know  that 
the  event  can  affect  10  out  of  100  data  streams  but  we  do  not  know  which  10 
data  streams  will  be  affected.  In  addition,  the  event  could  have  an  immediate 
or  delayed  impact  on  affected  data  streams.  One  example  is  biosurveillance 
when  a  disease  outbreak  may  first  occur  in  some  unknown  local  cities  or 
counties  and  then  spread  out  to  other  regions.  Another  example  occurs  in 
target  detection  when  one  uses  several  different  sensor  types  such  as  radar, 
sonar,  infrared  and  magnetic  to  detect  a  target,  and  it  is  possible  that  some 
(but  not  all)  sensors  can  provide  earlier  information  about  the  target. 

Ideally,  one  would  like  to  develop  a  single  global  monitoring  scheme  that 
can  detect  the  occurring  event  as  quickly  as  possible  while  controlling  the 
system-wise  global  false  alarm  rate.  While  such  global  monitoring  schemes 


can  be  found  by  the  standard  statistical  methods  such  as  generalized  likeli¬ 
hood  ratios  or  mixture  likelihood  ratios  from  the  theoretical  viewpoint,  they 
are  generally  infeasible  from  the  computational  viewpoint  since  they  would 
require  us  to  do  an  exhaustive  search  over  all  possible  post-change  hypothe¬ 
ses  or  over  all  possible  combinations  of  affected  data  streams.  Hence,  it  is 
desirable  to  find  a  global  monitoring  scheme  that  is  not  only  efficient,  but 
also  scalable  in  the  sense  of  being  able  to  be  implemented  to  monitor  a  large 
number  of  data  streams  over  a  long  period  of  time. 

To  develop  a  scalable  global  scheme,  one  intuitively  appealing  approach 
is  to  monitor  each  local  data  stream  locally  through  some  classical  compu¬ 
tationally  efficient  schemes  and  then  combine  all  local  schemes  together  to 
produce  a  single  global  scheme.  Unfortunately,  little  research  has  been  done 
on  how  to  best  combine  all  local  schemes  together  to  ensure  efficiency,  and  to 
the  best  of  our  knowledge,  only  two  existing  approaches  have  been  proposed 
so  far.  The  first  one  is  a  naive  approach  that  raises  an  alarm  at  the  global 
level  whenever  any  local  scheme  raises  a  local  alarm.  This  naive  approach  is 
very  effective  if  one  or  very  few  data  streams  are  affected,  but  it  may  lose 
efficiency  when  several  or  more  data  streams  can  provide  information  to  the 
occurring  event.  For  the  purpose  of  comparison,  below  the  global  scheme 
corresponding  to  this  naive  approach  will  be  referred  as  a  “MAX”  scheme, 
since  it  can  be  thought  of  raising  a  global  alarm  if  the  maximum  of  local 
monitoring  statistics  is  too  large,  where,  if  necessary,  one  can  normalize  the 
local  schemes  so  that  the  local  detection  thresholds  are  the  same.  Recently, 
in  Mei  [14] ,  the  present  author  proposes  to  raise  an  alarm  at  the  global  level 
if  the  sum  of  local  monitoring  statistics  (e.g.,  local  CUSUM  statistics  in  the 
logarithm  scale)  is  too  large,  and  theoretical  analysis  and  numerical  sim¬ 
ulations  show  that  the  corresponding  “SUM”  scheme  is  efficient  when  the 


number  of  affected  data  streams  is  very  large. 

The  objective  of  this  article  is  to  illustrate  that  there  are  other  simple 
approaches  to  combine  all  local  schemes  together  to  produce  an  efficient 
scalable  global  scheme.  Our  proposed  methodologies  are  motivated  by  the 
following  two  applications.  The  first  one  is  the  scenario  in  which  one  has  a 
prior  knowledge  that  at  most  r  out  of  K  data  streams  will  be  affected  by  the 
occurring  event,  especially  when  r  is  neither  too  small  nor  too  large,  e.g.,  r  = 
10  and  K  =  100.  This  scenario  may  be  defined  by  the  network  fault-tolerant 
design  to  avoid  risking  failure.  Unfortunately,  the  existing  “MAX”  or  “SUM” 
schemes  can  be  ineffective  under  this  scenario,  and  new  methodologies  are 
needed  to  take  advantage  of  such  a  prior  knowledge  in  globally  monitoring. 

The  second  motivation  of  our  proposed  methodologies  is  censoring  sensor 
networks  in  engineering,  which  was  introduced  by  Rago  et  al.  [22]  and  later 
by  Appadwedula  et  al.  [1]  and  by  Tay  et  al.  [30].  Figure  1  illustrates  the 
general  setting  of  a  widely  used  configuration  of  censoring  sensor  networks,  in 
which  the  data  streams  X^^^’s  are  observed  at  the  remote  sensors  (typically 
low-cost  battery-powered  devices) ,  but  the  final  decision  is  made  at  a  central 
location,  called  the  fusion  center.  The  key  feature  of  such  a  network  is  that 
while  sensing  (i.e.,  taking  observations  at  the  local  sensors)  are  generally 
cheap  and  affordable,  communication  between  remote  sensors  and  fusion 
center  are  expensive  in  terms  of  both  energy  and  limited  bandwidth.  Thus,  to 
prolong  the  reliability  and  lifetime  of  the  network  system,  practitioners  often 
allow  the  local  sensors  to  send  summary  messages  Uk,n^  to  the  fusion  center 
only  when  necessary.  The  question  then  becomes  when  and  how  to  send 
summary  messages  so  that  the  fusion  center  can  still  monitor  the  network 
system  effectively. 

The  above  two  considerations  motivate  us  to  propose  to  raise  a  global 
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Fig  1 .  General  setting  of  a  widely  used  eonfiguration  of  censoring  sensor  networks. 


alarm  based  on  the  sum  of  those  local  detection  statistics  (e.g.,  local  CUSUM 
statistics)  that  are  “large”  under  either  top-r  thresholding  rules  or  hard 
thresholding  or  both.  It  turns  out  that  the  corresponding  global  schemes 
include  both  MAX  and  SUM  schemes  as  special  cases  and  possess  certain 
asymptotic  optimality  properties  under  proper  criteria.  It  is  worth  point¬ 
ing  out  that  a  well-known  view  in  the  standard  off-line  statistical  inference 
literature  is  the  necessity  of  thresholding  for  high-dimensional  data  in  or¬ 
der  to  improve  power  or  efficiency.  Thus,  from  the  methodology  point  of 
view,  our  proposed  methodologies  are  analogous  to  those  off-line  statistical 
methods  such  as  (adaptive)  truncation,  and  soft  and  hard  thresholding,  see 
Neyman  [17],  Donoho  and  Johnstone  [4],  Fan  and  Lin  [6].  Also  see  Candes 
[3]  and  the  references  there.  However,  our  motivation  here  is  different  and 
our  application  to  sequential  change-point  detection  is  new. 

The  remainder  of  this  article  is  organized  as  follows.  In  Section  2  we 
present  a  rigorous  mathematical  formulation  of  sequential  change-point  de¬ 
tection  problems  in  the  context  of  globally  monitoring  multiple  data  streams 
and  also  discuss  existing  methodologies.  Our  proposed  methodologies  are  de¬ 
fined  in  Section  3  for  the  simplest  model.  Section  4  reports  numerical  Monte 
Carlo  simulation  results,  and  Section  5  presents  an  asymptotic  optimality 


theory.  Section  6  extends  our  methodologies  to  more  complicated  models. 
The  appendices  include  the  proofs  of  our  main  theorems,  Theorems  5.1  and 

5.2. 

2.  Problem  Formulation  and  Existing  Methodologies.  To  illus¬ 
trate  our  main  ideas,  we  begin  with  the  simplest  independent  model  in  which 
the  pre-change  and  post-change  distributions  are  completely  specified,  and 
more  complicated  models  will  be  discussed  in  Section  6.  To  be  more  specific, 
assume  that  in  a  system,  one  is  monitoring  K  data  streams  over  time  n,  say, 
{Xk,n}^=i  for  k  =  Initially,  the  system  is  “in  control”  and  the 

distribution  of  the  X^^nS  is  fk  for  the  /c-th  data  stream.  At  some  unknown 
time  n,  a  “trigger”  event  occurs  and  affects  the  system  in  the  sense  that 
the  density  function  of  the  observations  Xk^n’s  changes  from  fk  to  another 
given  density  gk  at  time  Vk  >  f.  Without  loss  of  generality,  we  assume  that 
mini<fc<i^  Vk  =  v,  since  otherwise  the  system  is  affected  by  the  event  only 
after  the  time  v'  =  mini<fc<xr'fc,  which  can  be  treated  as  the  new  change- 
point.  It  is  desired  to  utilize  the  observed  data  streams  Xk^nS  to  raise  an 
alarm  at  the  global  level  as  soon  as  the  event  occurs  so  that  one  can  take 
appropriate  action. 

A  standard  minimax  formulation  is  to  minimize  the  detection  delay  sub¬ 
ject  to  a  false  alarm  constraint.  The  latter  is  typical  of  the  form 

(2.1)  E(°°)(T)>7, 

where  T  is  the  stopping  rule  associated  with  the  detection  scheme,  7  >  0  is 
a  pre-specified  constant,  and  e(°°^  denote  the  expectation  when  there  are  no 
changes.  In  the  literature  is  often  called  as  the  average  run  length 

to  false  alarm.  The  definition  of  detection  delay  is  a  little  more  complicated, 
as  it  needs  to  take  into  account  the  uncertainty  of  the  change-point  zz.  One 


widely  used  definition  is  the  following  “worst  case”  detection  delay  defined 
in  Lorden  [11], 

(2.2)  D(r)=  sup  esssupE('^)((r- zy  + 1)+ 

l<u<oo  ^  ' 

where  x~^  =  max(x,0),  Jv_i  =  •  •  • , denotes  past 

global  information  at  time  i',  and  ■  ■  ■ ,  Xk^u-i)  is  past 

local  information  for  the  /c-th  data  stream.  Note  that  in  the  definition 

(2.2)  is  obscure  as  it  does  not  takes  into  account  different  z^^’s  for  different 
data  streams,  see  (5.2)  below  for  a  more  precise  definition.  It  is  often  but 
not  always  that  the  worst  case  detection  delay  of  a  scheme  is  attained  when 
the  change-point  1^  =  1,  since  it  is  generally  more  difficult  to  detect  when  a 
change  occurs  at  earlier  stages  rather  than  at  latter  stages. 

Before  discussing  the  existing  schemes  for  globally  monitoring,  let  us  con¬ 
sider  the  local  schemes  for  monitoring  a  single  data  stream,  say,  the  kth  data 
stream.  Such  a  problem  has  been  well  studied  in  the  sequential  change-point 
detection  literature,  and  one  efficient  local  scheme  is  Page’s  CUSUM  proce¬ 
dure  which  raise  a  local  alarm  as  soon  as  the  local  CUSUM  statistic  exceeds 
some  pre-specified  threshold,  where  the  local  CUSUM  statistic  for  the  fcth 
data  stream  at  time  n  is  given  by 

In  practice,  the  local  CUSUM  statistic  Wk^n  can  be  computed  conveniently 
online  via  a  recursive  formula: 

(2.3)  Wk,n  =  max  (Wk,n-i  +  log  ’ 

and  114,0  =  0- 

Now  let  us  go  back  our  global  monitoring  problem.  As  mentioned  in  the  in¬ 
troduction,  two  approaches  have  been  proposed  to  combine  all  local  schemes 


together  to  develop  a  single  scalable  global  scheme.  The  first  one  is  the 
“MAX”  scheme  that  raises  an  alarm  at  the  global  level  as  soon  as  one  of 
local  CUSUM  procedures  raises  a  local  alarm.  Mathematically,  the  “MAX” 
scheme  raises  a  global  alarm  at  time 

(2.4)  Tmax(c)  =  first  time  n  >  1  such  that  max  Whn  >  c, 

l<k<K 

(=  oo  if  such  n  does  not  exist)  where  c  >  0  is  a  pre-specified  constant 
chosen  to  satisfy  the  false  alarm  constraint  (2.1).  The  second  approach  is 
the  “SUM”  scheme,  proposed  in  Mei  [14],  in  which  one  raises  an  alarm  when 
the  sum  of  local  CUSUM  statistics  is  too  large.  Specifically,  at  time  n,  each 
data  streams  calculates  its  local  CUSUM  statistic  114, n’s  as  in  (2.3),  and 
then  one  will  raise  an  alarm  at  the  global  level  at  time 

K 

(2.5)  Tsiim  id)  =  first  time  n  >  1  such  that  ^  Wk^n  >  d, 

k=l 

(=  oo  if  such  n  does  not  exist)  where  the  constant  d  >  0  is  some  suitably 
chosen  constant.  The  intuition  behind  the  “SUM”  rule  is  that  a  large  value 
of  local  CUSUM  statistic  indicates  a  possible  local  change,  and  thus  the  sum 
of  local  CUSUM  statistics  is  a  reasonable  detection  statistic  if  the  number 
of  affected  data  streams  is  large  or  completely  unknown.  Intuitively,  the 
“MAX”  scheme  Tmax(c)  in  (2.4)  works  better  when  one  or  very  few  data 
streams  are  affected,  whereas  the  “SUM”  scheme  rsuin(d)  in  (2.5)  works 
better  when  many  data  streams  are  affected,  and  numerical  simulations  in 
Mei  [14]  indeed  verified  this  intuition. 

3.  The  Proposed  Methodologies.  In  this  section,  we  propose  to  de¬ 
velop  global  monitoring  schemes  via  top-r  thresholding  and/or  hard  thresh¬ 
olding.  It  turns  out  that  the  proposed  schemes  integrate  the  existing  “MAX” 


and  “SUM”  schemes,  and  offer  new  methodologies  for  online  monitoring  a 
large  number  of  data  streams. 


3.1.  Top-r  Thresholding  Schemes.  In  some  applications,  one  may  have 

a  prior  knowledge  that  at  most  r  out  of  K  data  streams  will  be  affected 
by  the  occurring  event.  This  inspires  us  to  construct  the  global  monitoring 
schemes  based  on  the  largest  r  local  CUSUM  statistics.  To  be  more  concrete, 
at  time  n,  order  the  K  local  CUSUM  statistics  Wi^n,  ■  ■  ■ ,  WK,n  from  largest 
to  smallest:  >  lU(2),n  >  •  •  •  >  and  then  the  top-r  thresholding 

scheme  raises  a  global  alarm  at  time 

r 

(3.1)  Ntop,r{o-)  =  first  time  n  >  1  such  that  E  "'(tin  >  «■ 

k=l 

Obviously,  the  top-r  thresholding  scheme  Ntop,r{a)  becomes  the  “MAX” 
scheme  when  r  =  1,  and  becomes  the  “SUM”  scheme  Tsum(d)  when 

r  =  K.  Intuitively,  when  the  occurring  event  affects  (at  most)  r  out  of  K 
data  streams,  then  (at  most)  r  local  CUSUM  statistics  will  be  significantly 
large  to  provide  information  to  the  occurring  event,  and  thus  one  should 
expect  Ntop,r{o)  to  be  suitable  and  effective  to  detect  such  a  scenario. 

3.2.  Hard  Thresholding  Schemes.  In  censoring  sensor  networks  illustrated 
in  Figure  1,  the  local  sensors  need  to  summarize  the  information  and  only 
send  significant  information  to  the  fusion  center  to  prolong  the  reliability 
and  lifetime  of  the  network  system.  This  inspires  us  to  propose  to  transmit 
only  those  local  CUSUM  statistics  Wk,n^s  that  are  larger  than  their  re¬ 
spective  local  thresholds,  and  the  corresponding  scheme  will  be  called  hard 
thresholding  schemes. 

Specifically,  at  time  n,  each  local  sensor  (data  stream)  calculates  its  local 
CUSUM  statistic  Wk^n  recursively  as  in  (2.3).  Then  at  time  n,  the  sensor 


message  from  the  sensor  to  the  fusion  center  is  given  by 


I  Wk,n,  if  Wk,n  >  bk 

[  NULL,  if  Wk,n  <  bk 

where  bk  >  0  is  the  local  censoring  (hard  threshold)  parameter  at  the  k-th. 
sensor  (or  data  stream).  Here  the  message  “NULL”  is  a  special  sensor  symbol 
to  indicate  the  local  CUSUM  statistic  is  not  large.  In  practice,  “NULL”  could 
be  represented  by  the  situation  when  the  sensor  does  not  send  any  messages 
to  the  fusion  center,  e.g.,  the  sensor  is  silent. 

After  receiving  the  local  sensor  messages  Uk^n’s  from  the  sensors,  the 
fusion  center  then  raises  an  alarm  at  the  global  level  at  hrst  time  n  such 


that 


received 

where  the  detection  threshold  a  >  0  is  a  pre-specified  constant. 

So  far  we  simply  follow  our  intuition  without  discussing  how  to  choose 
the  local  censoring  parameters  ft^’s,  especially  when  the  data  streams  are 
nonhomogeneous.  It  turns  out  that  a  “good”  choice  of  the  bkS  is  bk  =  Pkb 
for  k  =  1, ...  ,K  for  some  constant  6  >  0  where 

kdki  fk) 


(3.2) 


Pk  = 


J2k=il{9k,fk) 


can  be  thought  of  as  the  weight  of  the  A:-th  data  stream  in  the  overall  final 
decision,  and  lipk^fk)  is  the  Kullback-Leibler  information  number  defined 

by 


(3.3)  HokJk)  =  [  log^Y^gk{x)dp{x). 

J  Jk{x) 

With  this  choice  of  b^s,  the  proposed  hard  thresholding  scheme  will  raise  a 
global  alarm  at  time 

K 

(3.4)  Nhardia,  b)  =  first  n  >  I  such  that  ^  Wk,nl{'^k,n  >  Pkb}  >  a, 

k=l 


where  a  >  0  and  6  >  0  are  two  suitably  chosen  constants. 

Evidently,  if  the  threshold  parameter  6  =  0,  then  the  hard  thresholding 
scheme  Nhard{a,  b)  in  (3.4)  becomes  the  “SUM”  scheme  Ts,im(ii)  in  (2.5)  since 
the  Wk,n’s  are  always  non-negative  by  the  definition  in  (2.3).  On  the  other 
hand,  if  the  threshold  parameter  b  is  very  large,  say  b  >  a/  Pk, 

then  the  hard  thresholding  scheme  Nhard{(k^  b)  in  (3.4)  becomes  the  “MAX” 
scheme  rmax(c)  in  (2.4)  with  c  =  mmk{pkb).  Therefore,  the  family  of  schemes 
Nhardic^jb)  is  actually  a  very  large  family  that  includes  both  “MAX”  and 
“SUM”  schemes. 

It  is  interesting  to  note  that  the  proposed  hard  thresholding  and  top- 
r  thresholding  schemes  are  closely  related  to  the  concept  of  “censoring” 
in  engineering,  statistics,  and  medical  science,  particularly  reliability  and 
survival  analysis.  Specifically,  the  hard  thresholding  scheme  is  in  parallel  to 
the  so-called  “Type  I  censoring”  in  which  an  experimenter  has  a  set  number 
of  subjects  or  items  and  stops  the  experiment  at  a  pre-determined  time, 
whereas  the  top-r  thresholding  scheme  is  in  parallel  to  the  so-called  “Type  II 
censoring”  in  which  one  stops  the  experiment  when  a  predetermined  number 
of  subjects  are  observed  to  have  certain  properties.  In  addition,  the  top-r 
thresholding  schemes  can  also  be  thought  of  hard-thresholding  rules  where 
the  local  censoring  parameters  are  data-driven  and  adaptive  over  time  n. 
Specifically,  let  Wr,n  denote  the  r-th  largest  statistics,  i.e.,  Wr,n  =  lU(r),n) 
and  then  one  raises  an  alarm  at  the  global  level  at  the  time 

K 

N^gp  ria)  =  first  n  >  1  such  that  E  \y^k,n  —  ^r,n}  ^ 

k=l 

Rigorously  speaking,  we  have  Ntop,r{o)  <  they  are  equivalent 

only  when  the  IUfc,n’s  are  non- arithmetic,  since  they  can  be  different  when 
more  than  one  of  lUfc^n’s  is  equal  to  Wr,n-  Fortunately,  both  Ntop,r{a)  and 


^top,r  (®)  possess  similar  asymptotic  optimality  properties  and  either  can  be 
used  in  our  context. 

3.3.  The  Combined  Thresholding  Schemes.  For  the  purpose  of  applica¬ 
tions  in  censoring  sensor  networks  in  Figure  1,  one  may  combine  the  above- 
mentioned  two  thresholding  methods  together:  use  the  hard  thresholding 
at  the  local  sensor  level,  and  then  adopt  the  top-r  thresholding  rule  at 
the  fusion  center  level  to  detect  the  event  when  one  has  a  prior  knowl¬ 
edge  that  the  occurring  event  affects  at  most  r  sensors.  Specifically,  at 
time  n,  the  sensor  message  sent  by  the  /c-th  local  sensor  is  defined  by 
Uk,n  =  Wk,nI{Wk,n  >  Pkb}-  The  fusion  center  then  orders  all  sensor  mes¬ 
sages  Uk,nS  as  >  •  •  •  >  raises  an  alarm  if  the  sum  of 

the  r  largest  Uk^nS  is  too  large.  Mathematically,  the  combined  thresholding 
scheme  raises  an  alarm  at  the  global  level  at  time 

r 

(3.5)  Ncomb,r{0‘7b)  =  first  n  >  1  such  that  E  bJ(k),n  — 

k=l 

Alternatively,  let  Ur^n  be  the  r-th  largest  value  among  the  K  sensor  messages 
Uk^n’Sj  and  then  another  version  of  the  combined  thresholding  scheme  can 
be  defined  by  the  stopping  time 

K 

KombA^7  b)  =  inf  |n  >  1  :  ^  Uk,nI{Uk,n  >  Ur,n]  >  a} 

k=l 

K 

=  inf  |n  >  1  :  ^  [Wk^nH^k^n  >  Pkb}I{Wk,n  >  Ur,n}]  >  a}- 
k=l 

Again,  Ncomb,r{cL,  b)  <  Nlombri^-:  but  they  are  equivalent  in  non-arithmetic 
cases  and  are  generally  asymptotically  equivalent  otherwise.  Evidently,  the 
family  of  the  combined  thresholding  schemes  contains  two  censoring  param¬ 
eters,  h  and  r,  and  it  includes  the  families  of  both  hard-thresholding  and 
top-r  thresholding  schemes. 


3.4.  Choice  Of  Thresholding  Parameters.  Compared  to  the  existing  “MAX” 
or  “SUM”  schemes,  our  proposed  thresholding  schemes  include  two  new 
thresholding  parameters:  r  for  the  top-r  thresholding  schemes  and  b  for  the 
hard-thresholding  schemes  (and  both  r  and  h  for  the  combined  thresholding 
schemes).  It  is  natural  to  ask  how  to  choose  these  two  thresholding  param¬ 
eters  in  practice? 

The  choice  of  thresholding  parameter  r  is  straightforward  and  depends 
on  whether  one  has  any  prior  knowledge  about  the  maximum  number  of 
affected  data  streams.  If  such  a  knowledge  exists  and  it  is  believed  that  at 
most  ro  data  streams  will  be  affected  by  the  occuring  event,  then  one  should 
use  this  ro  as  the  value  of  thresholding  parameter  r.  Otherwise  one  may 
want  to  be  conservative  to  choose  r  =  K,  e.g.,  consider  the  “SUM”  scheme 
or  the  hard-thresholding  scheme  Nhard{a,b)  in  (3.4). 

The  choice  of  thresholding  parameter  b  is  nontrivial,  and  may  need  to 
consider  some  non-statistical  constraints.  As  an  illustration,  in  certain  ap¬ 
plications  of  censoring  sensor  networks,  the  censoring  parameter  b  may  be 
chosen  to  satisfy  the  constraints  on  the  average  fraction  of  transmitting  sen¬ 
sors  when  no  events  occur.  For  our  proposed  scheme  Nhardia,b),  when  no 
event  occurs,  the  average  fraction  of  transmitting  sensors  at  any  time  step 
n  is 

1  ^  1  ^ 

-^p(°°)(C/fc,„/NULL)  =  -J2P^°^\Wk,n>Pkb) 

1  ^ 

^  k=i 

where  the  last  inequality  follows  from  the  well-known  properties  of  the  local 
CUSUM  statistics  that  >  a)  <  exp(— a)  for  all  a  >  0,  see,  for 

example.  Appendix  2  on  Page  245  of  Siegmund  [26].  In  particular,  if  all  K 
sensors  are  homogeneous  in  the  sense  that  the  I^gkjfkYs  are  the  same  for 


all  k,  then  =  1/K,  and  the  average  fraction  of  transmitting  sensors  at 
any  time  step  is  exp(— when  no  event  occurs.  Hence  for  our  proposed 
scheme  Nhardid,  b),  a  choice  of 

b  =  K  log  ri~^, 

or  equivalently,  the  local  hard  threshold  bk  =  Pkb  =  b/K  =  logr/”^,  will 
guarantee  that  on  average,  at  most  100?]%  of  K  homogeneous  sensors  will 
transmit  messages  at  any  given  time  when  no  event  occurs.  It  is  worth 
emphasizing  that  here  the  thresholding  parameter  6  is  a  function  of  K  and 
the  local  threshold  b^  =  log  7?“^  is  a  constant  that  does  not  depend  on  K. 

The  choice  of  b  becomes  more  complicated  for  the  combined  thresholding 
schemes  Ncomb,r{cL,b)  (or  N*^f^^{a,b))  if  the  thresholding  parameter  r  has 
been  given  beforehand.  We  do  not  have  an  explicit  answer,  and  a  general 
rule  of  thumb  that  the  censoring  parameter  b  in  (3.5)  shall  not  be  too  large, 
as  one  generally  should  keep  at  least  r  non-zero  Uk^nS  when  r  data  streams 
are  affected  by  the  event. 

4.  Numerical  Simulations.  Before  we  offer  asymptotic  optimality  the¬ 
ory,  let  us  present  a  numerical  simulation  study  in  this  section  to  illustrate 
the  usefulness  of  the  proposed  schemes.  Suppose  that  there  are  K  =  100 
independent  and  identical  sensors  in  a  system  illustrated  in  Figure  1,  and 
each  local  sensor  observes  a  one-dimensional  data  stream.  Assume  that  the 
observations  at  each  local  sensor  are  i.i.d.  with  mean  0  and  variance  1  before 
the  change  and  with  mean  0.5  and  variance  1  after  the  possible  change.  In 
our  simulation  study,  we  simply  assume  that  the  change  is  instantaneous  if 
a  sensor  is  affected,  i.e.,  the  local  change-point  =  i/  or  oo,  and  we  do  not 
know  which  subset  of  sensors  will  be  affected  by  the  occurring  event. 


For  the  purpose  of  comparison,  we  conduct  numerical  simulations  for  five 
families  of  detection  schemes: 

•  the  “MAX”  scheme  rinax(c)  in  (2.4), 

•  the  “SUM”  scheme  (d)  in  (2.5), 

•  the  top-r  thresholding  scheme  Ntop,r{o-)  in  (3.1)  with  r  =  10, 

•  the  hard  thresholding  scheme  Nhard{cL,h)  in  (3.4), 

•  the  combined  thresholding  schemes  Ncomb,r{<d,b)  in  (3.5)  with  r  =  10. 

For  each  family  of  schemes  Nfiard{cL,  b)  and  iVcomfe, r=io(®)  b),  we  further  con¬ 
sider  three  specific  schemes,  depending  on  the  value  of  the  hard-thresholding 
parameter  b:  (i)  b  =  iF/2  ps  —  log(0.607)  *  K,  (ii)  b  =  —  log(O.l)  *  K  = 
2.3026Ar  and  (hi)  b  =  —  log(O.Ol)  *  K  =  4.6052A'.  In  the  context  of  censor¬ 
ing  sensor  networks,  the  choices  of  these  values  will  guarantee  that  when 
no  event  occurs,  on  average  at  most  r]  =  60.7%,  10%,  and  1%  of  A'  =  100 
homogeneous  sensors  will  transmit  messages  at  any  given  time,  respectively. 
Hence,  in  our  simulation  study,  there  are  a  total  of  nine  specific  monitoring 
schemes. 

In  order  to  fairly  compare  these  nine  specific  schemes  T{a),  we  applied 
them  to  the  same  computer-generated  pseudo-random  data  sets  in  our  sim¬ 
ulation  studies.  Specifically,  we  first  simulate  a  single  large  random  matrix 
with  dimension  m  x  nmax  x  K  from  the  pre-change  distributions,  where  m 
denotes  the  desired  number  of  repetitions  in  Monte  Carlo  simulations,  nmax 
is  the  largest  time  step  (or  the  largest  run  length  to  raise  an  alarm),  and 
K  is  the  number  of  data  streams  (or  sensors).  In  our  numerical  simulations, 
m  =  1000,  nmax  =  2*10^  and  K  =  100.  Then,  for  each  of  these  nine  specific 
schemes  T(a),  the  simulated  data  set  was  used  to  determine  the  appropriate 
values  of  the  detection  threshold  a  to  satisfy  Eoo(T'(a))  Ri  7  (within  the 


Table  1 


Detection  delays  with  K  =  100  identical  data  streams:  20  or  more  streams  are  affected 


Detection 

ff  affected  data  streams 

7 

Scheme 

100 

80 

50 

30 

20 

Tum(d=  101.66) 

5.55  ±0.02 

6.53  ±0.02 

9.14  ±0.04 

13.0  ±0.1 

17.3  ±0.1 

Nhard{a  =  96.79,  b  =  50) 

5.58  ±0.02 

6.54  ±0.02 

9.16  ±0.04 

13.0  ±0.1 

17.4  ±0.1 

Nhardia  =  52.28,  b  =  230.26) 

7.72  ±  0.02 

8.49  ±0.03 

10.47  ±0.04 

13.6  ±0.1 

17.1  ±0.1 

Nhardla  =  21.90,  b  =  460.52) 

12.57  ±  0.04 

13.30  ±0.05 

14.91  ±0.06 

17.1  ±0.1 

19.5  ±0.1 

Tnax(c  =  8.77) 

22.46  ±  0.11 

23.23  ±0.11 

24.64  ±0.13 

26.9  ±0.2 

28.7  ±0.2 

10® 

Schemes  Ncom.b,r{a,b)  in  (3.5) 

Ntop,r^io{a  =  41.67) 

10.82  ±0.03 

11.48  ±0.04 

13.07  ±0.05 

15.3  ±0.1 

17.9  ±0.1 

Ncomb,r=io{a  =  41.67,6  =  50) 

10.82  ±0.03 

11.48  ±0.04 

13.07  ±0.05 

15.3  ±0.1 

17.9  ±0.1 

N^omb,r^io{a  =  41.49,6  =  230.26) 

10.73  ±0.03 

11.41  ±0.04 

12.98  ±0.05 

15.2  ±0.1 

17.8  ±0.1 

Ncomb,r=io{a  —  21.90,6  =  460.52) 

12.57  ±  0.04 

13.30  ±0.05 

14.91  ±0.06 

17.1  ±0.1 

19.5  ±0.1 

Tum(d=  111.04) 

6.16  ±0.02 

7.29  ±0.02 

10.25  ±0.04 

14.9  ±0.1 

20.1  ±0.1 

Nwd(a  =  106.38,6  =  50) 

6.17  ±0.02 

7.29  ±0.02 

10.27  ±0.04 

14.9  ±0.1 

20.2  ±0.1 

Nhard{a  =  62.26,  6  =  230.26) 

8.34  ±0.03 

9.22  ±0.03 

11.53  ±0.04 

15.3  ±0.1 

19.7  ±0.1 

Nhardla  =  29.70,  6  =  460.52) 

13.39  ±0.04 

14.17  ±0.05 

16.14  ±0.06 

19.0  ±0.1 

21.9  ±0.1 

Taax(c=  11.12) 

31.79  ±  0.14 

32.74  ±0.15 

34.81  ±0.17 

37.6  ±0.2 

39.9  ±0.2 

10® 

Schemes  Ncomb,r{a,b)  in  (3.5) 

Nfop,T^w{a  =  46.55) 

12.67  ±  0.04 

13.41  ±0.04 

15.22  ±0.05 

17.8  ±0.1 

20.8  ±0.1 

Nco-nib,r=io{a  =  46.55,6  =  50) 

12.67  ±  0.04 

13.41  ±0.04 

15.22  ±0.05 

17.8  ±0.1 

20.8  ±0.1 

Naomb,r^io{a  —  46.53,6  =  230.26) 

12.67  ±  0.04 

13.41  ±0.04 

15.21  ±0.05 

17.8  ±0.1 

20.8  ±0.1 

Naomb,r=io{a  =  29.70, 6  =  460.52) 

13.39  ±0.04 

14.17  ±0.04 

16.14  ±0.06 

19.0  ±0.1 

21.9  ±0.2 

range  of  sampling  error  based  on  m  repetition  Monte  Carlo  simulations). 
Next,  using  the  obtained  threshold  value  a,  we  simulated  the  detection  de¬ 
lay  when  the  change-point  occurs  at  time  i/  =  1  under  several  different 
post-change  scenarios  (i.e.,  different  number  of  affected  data  streams)  by 
adding  the  post-change  mean  fii  =  0.5  appropriately  to  the  simulated  data 
matrix.  This  will  provide  the  estimated  values  of  the  worst  case  detection 
delays  because,  for  each  of  these  nine  schemes  we  considered,  the  worst  case 
detection  delay  occurs  when  the  change-point  ly  =  1. 

In  our  simulations  we  consider  several  post-change  scenarios,  depending 
on  how  many  data  streams  are  affected.  To  better  summarize  our  simula¬ 
tions,  we  have  divided  our  results  into  two  tables;  Table  1  for  the  cases  when 
20  or  more  data  streams  are  affected  and  Table  2  for  the  cases  when  10  or 
less  data  streams  are  affected.  In  each  table,  two  different  values  of  the  false 


Table  2 


Detection  delays  with  K  =  100  identical  data  streams:  10  or  less  streams  are  affected 


Detection 

ff  affected  data  streams 

7 

Scheme 

10 

8 

5 

3 

1 

Tum(d=  101.66) 

27.6  ±0.2 

32.5  ±0.2 

44.1  ±0.4 

61.3  ±0.6 

127.0  ±  1.7 

Xwd(a  =  96.79,6  =  50) 

27.9  ±0.2 

32.7  ±0.2 

44.8  ±0.4 

62.1  ±  0.6 

128.8  ±  1.7 

Nhardia  =  52.28,  b  =  230.26) 

26.5  ±0.2 

30.6  ±0.2 

41.7±0.3 

59.4  ±0.6 

124.6  ±  1.6 

Nhardla  =  21.90,  b  =  460.52) 

25.5  ±0.2 

28.2  ±0.2 

35.4  ±0.3 

46.9  ±  0.4 

99.1  ±  1.2 

Tiiaxlc  =  8.77) 

33.0  ±0.2 

34.5  ±0.3 

38.6  ±0.3 

44.4  ±  0.4 

65.8  ±0.9 

10® 

Schemes  Ncom.b,r{a,b)  in  (3.5) 

Ntop,r^io{a  =  41.67) 

24.3  ±0.2 

27.1  ±0.2 

34.6  ±0.3 

45.9  ±  0.4 

89.3  ±  1.1 

Ncomb,r=io{a  =  41.67,6  =  50) 

24.3  ±0.2 

27.1  ±0.2 

34.6  ±0.3 

45.9  ±  0.4 

89.3  ±  1.1 

N^omb,r^io{a  =  41.49,6  =  230.26) 

24.3  ±0.2 

27.2  ±0.2 

35.1  ±0.3 

47.1  ±  0.4 

92.0  ±  1.1 

Ncomb,r=io{a  —  21.90,6  =  460.52) 

25.5  ±0.2 

28.2  ±0.2 

35.4  ±0.3 

46.9  ±  0.4 

99.1  ±  1.2 

Tum(d=  111.04) 

33.4  ±0.2 

39.3  ±0.3 

55.2  ±0.4 

80.2  ±0.7 

191.6  ±2.1 

Xwd(a  =  106.38,6  =  50) 

33.8  ±0.2 

39.8  ±0.3 

56.1  ±0.5 

81.2  ±0.7 

195.5  ±2.1 

Nhard{a  =  62.26,  6  =  230.26) 

31.9  ±0.2 

37.7  ±0.2 

53.7  ±0.4 

78.4  ±0.7 

191.6  ±2.1 

Nhardla  =  29.70,  6  =  460.52) 

29.9  ±0.2 

33.5  ±0.2 

43.3  ±0.3 

61.3  ±0.5 

152.6  ±  1.7 

Taax(c=  11.12) 

45.2  ±0.3 

47.2  ±0.3 

52.3  ±0.4 

59.2  ±0.5 

85.5  ±  1.0 

10® 

Schemes  Ncomb,r{a,b)  in  (3.5) 

Ntop,r=io{a  =  46.55) 

28.6  ±0.2 

32.2  ±0.2 

41.8  ±0.3 

57.1  ±0.5 

124.2  ±  1.4 

Naomb,T^w{a  =  46.55,6  =  50) 

28.6  ±0.2 

32.2  ±0.2 

41.8  ±0.3 

57.1  ±0.5 

124.2  ±  1.4 

Naomb,r^w{a  =  46.53,6  =  230.26) 

28.6  ±0.2 

32.3  ±0.2 

42.3  ±0.3 

58.3  ±0.5 

128.0  ±  1.4 

Naomb,r^w{a  =  29.70, 6  =  460.52) 

29.9  ±0.2 

33.5  ±0.2 

43.4  ±0.3 

61.3  ±0.5 

152.6  ±  1.8 

alarm  constraint  7  in  (2.1)  are  considered;  7  =  10^  and  10^,  and  the  detection 
delay  is  recorded  as  the  Monte  carlo  estimate  ±  standard  error.  Moreover, 
for  each  given  false  alarm  constraint  7,  we  further  divided  our  results  into 
two  sub-scenarios:  one  for  the  family  of  schemes  Nhard{ct,  b)  and  one  for  the 
family  of  schemes  A^comb,r=io(0)  &)•  Recall  that  under  our  numerical  simula¬ 
tions  setting,  the  “MAX”  scheme  Tmax(c)  can  be  thought  of  Nhard{a,  b)  with 
b  =  Kc  =  100c  and  a  =  c,  while  the  “SUM”  scheme  Tsum(d)  can  be  thought 
of  Nyard{(ti  b)  with  6  =  0  and  a  =  d.  Likewise,  the  top-r  thresholding  scheme 
Ntop,r=io{a)  in  (3.1)  can  be  thought  of  A^comfe,r=io(0)  &)  with  6  =  0.  Thus  in 
each  simulation  scenario,  we  report  the  numerical  results  of  these  specific 
schemes  in  order  of  increasing  values  of  the  censoring  parameter  6. 

From  Tables  1  and  2,  among  these  nine  specific  schemes,  when  a  small 
number  (1  ~  3)  of  100  homogeneous  data  streams  are  affected  by  the  event. 


the  “MAX”  scheme  Tmav(c)  is  the  best  (in  the  sense  of  smallest  detec¬ 
tion  delay),  the  “SUM”  scheme  Tsum(d)  is  the  worst,  and  all  other  schemes 
^hard{a,b)’s  or  Ncomb,r{a,  b)'s  are  in-between.  Similarly,  when  a  large  num¬ 
ber  (20  or  more)  of  100  homogeneous  data  streams  are  affected,  the  order 
is  reserved:  Tsi.m  (d)  is  the  best,  is  the  worst,  and  all  other  schemes 

are  in-between.  However,  when  5  ~  10  data  streams  are  affected,  the  best 
schemes  are  the  family  of  schemes  Ncomb,r=io{a',b),  which  is  designed  to  de¬ 
tect  the  scenario  when  10  data  streams  are  affected  by  the  event.  In  addition, 
for  each  given  scheme,  the  fewer  affected  data  streams  we  have,  the  larger 
detection  delay  it  will  have.  Similarly,  the  larger  value  the  false  alarm  con¬ 
straint  7  in  (2.1),  the  larger  detection  delay  it  will  have.  All  these  results 
are  consistent  with  our  intuition. 

It  worths  mentioning  that  for  the  family  of  the  hard-thresholding  schemes 
Nhard{0‘-,b)  in  (3.4),  a  larger  censoring  threshold  value  b  actually  leads  to  a 
smaller  detection  delay  when  only  a  few  (between  1  and  5)  of  data  streams 
are  affected  (recall  that  Tma.x{c)  can  be  thought  of  Nhardia,  b)  with  the  largest 
censoring  parameter  b  =  1 00c).  This  is  consistent  with  our  earlier  discussion 
that  a  larger  censoring  threshold  value  b  in  Nhard{a,b)  may  actually  be 
necessary  for  efficient  detection  when  the  affected  data  streams  are  sparse. 

A  surprising  and  possibly  counter-intuitive  result  in  Tables  1  and  2  is  the 
effect  of  not  so  large  values  of  hard-thresholding  parameter  b  in  finite  sample 
simulations.  For  example,  the  performances  of  the  “SUM”  scheme  Tsnra{d) 
and  the  hard  thresholding  scheme  Nhardia,b  =  50)  are  similar  in  view  of 
sampling  errors.  Likewise,  the  top-r  thresholding  scheme  Ntop,r=io{cL)  and 
the  combined  thresholding  scheme  Ncomb,r=io{<i-,b  =  50)  also  have  identi¬ 
cal  performances.  In  other  words,  for  the  family  of  schemes  Nhard{(^,b)  or 
Nf.omb,r{0'jb),  the  schemes  with  6  =  0  or  6  =  50  have  similar  performances. 


even  though  6  =  50  implies  that  the  scheme  only  requires  exp(— 6/X)  = 
exp(— 0.5)  =  60.7%  of  100  sensors  to  transmit  information  to  the  fusion 
center  at  any  given  time  when  no  event  occurs. 

It  is  also  interesting  to  see  the  effect  of  the  top-r  thresholding  parame¬ 
ter  r  in  finite  sample  simulations  when  the  hard-thresholding  parameter  h 
is  large.  From  Tables  1  and  2,  when  the  false  alarm  constraint  7  in  (2.1) 
is  only  moderately  large,  the  performances  of  Nhard{cL,  b  =  460.52)  and 
A^comb,r=io(0)  ^  =  460.52)  are  identical,  and  they  actually  also  have  the  same 
detection  threshold  a.  Intuitively,  the  stopping  time  Ncomb,r{a,  b)  is  decreas¬ 
ing  as  a  function  of  r,  and  thus  we  have  Nhardici,  b)  <  Ncomb,r=ioia,  b)  when 
b  =  460.52.  So  one  may  wonder  why  our  numerical  simulations  lead  to  iden¬ 
tical  results?  One  explanation  is  that  with  such  a  choice  of  6  =  460.52,  when 
no  event  occurs,  on  average  there  is  at  most  1  non-zero  sensor  messages 
at  any  given  time,  and  thus  there  is  little  difference  whether  one  uses  the 
sum  of  the  largest  r  =  10  sensor  messages  or  uses  the  sum  of  all  K  =  100 
sensor  messages.  Hence  similar  performances  are  observed  in  finite-sample 
simulations. 

In  summary,  from  the  performance  viewpoint,  using  one  of  hard-thresholding 
and  top-r  thresholding  approaches  may  be  sufficient  in  certain  applications, 
since  the  performance  of  the  combined  censoring  scheme  Ncomb,riO'j  b)  can  be 
similar  to  that  of  either  the  hard-thresholding  scheme  Ni^^ardiO':  b)  or  the  top 
r-thresholding  scheme  Ntop,r{o)-,  especially  when  the  false  alarm  constraint 
7  in  (2.1)  is  only  moderately  large. 

5.  Asymptotic  Optimality  Theory.  In  this  section  we  establish  asymp¬ 
totic  optimality  properties  of  our  proposed  thresholding  schemes.  To  simplify 
our  arguments,  denote  by  <5^  =  —  z/  >  0  the  delay  effect  on  the  /c-th  data 


stream,  and  let  5max  =  5^  be  the  maximum  delay  among  all  finite 

delays.  Recall  that  we  assume  min^  Vj^  =  v  and  thus  min^.  6k  =  0.  Therefore, 
if  we  denote  by  A  the  set  of  all  possible  post-change  hypotheses  on  the  delay 
effects  6k  s  and/or  the  subset  of  affected  data  streams,  then  the  set  A  can 
be  written  as 


A  =  {(5i, . . .  ,6k)  '■  the  6k  s  either  =  oo 

(5.1)  or  satisfy  0  <  <5^  <  (^max  and  min  =  0}. 

l<k<K 

Moreover,  we  need  to  provide  a  more  rigorous  definition  of  detection  delay 
than  that  in  (2.2)  to  reflect  the  delay  effects  6kS.  Denote  by  and 

Sk  probability  measure  and  expectation  when  the  event  occurs 
at  time  v  and  the  density  of  observations  Xk^nS  at  the  /c-the  data  stream 
changes  from  fk  to  gk  at  time  Vk  =  ^  for  all  A:  =  1, . . . ,  A'.  As  in  the 
definition  D(T)  in  (2.2),  we  can  define  the  detection  delay  than  as 


(5.2) 


E 


<5i,...,(5x 


(T)  = 


sup 

l<i^<oo 


esssupE^'^'J  ,5^ 


T-u  +  iy 


X,,-i 


Here  we  use  E^^  ,  ,5^(T)  to  emphasize  that  the  detection  delays  may  depend 
on  the  delay  effects  6k  s.  In  our  asymptotic  theory,  we  assume  that  the  delay 
effects  set  A  in  (5.1)  and  the  unknown  change-point  v  are  separated,  i.e.,  the 
set  A  does  not  depend  on  the  unknown  change-point  v.  Under  our  setting, 
detecting  the  unknown  change-point  v  is  of  primary  interest,  and  the  delay 
effects  6k  s  are  nuisance  parameters  that  belong  to  some  pre-specified  set  A 
depending  on  our  prior  knowledge  of  the  event. 

Let  us  present  asymptotic  theory  when  A  is  defined  in  (5.1).  The  following 
theorem,  whose  proof  is  postponed  in  the  appendix,  derives  the  information 
bound  on  the  detection  delays  of  any  globally  monitoring  schemes,  as  the 
false  alarm  constraint  7  in  (2.1)  goes  to  00. 


Theorem  5.1.  Assume  a  scheme  T{'^)  satisfies  the  false  alarm  con¬ 
straint  (2.1).  Then  for  any  given  post-change  hypothesis  ((5i, . . .  ,Sk)  £  A, 
as  7  goes  to  oo, 

. 

where 

K 

(5.4)  J{5i,...,5k)  =  ^/(c/fc,/fc)/{4  <  oo}, 

k=l 

and  I{gk,fk)  is  the  Kullback-Leibler  information  number  defined  in  (3.3), 
and  I{A}  is  the  indicator  funetion  of  set  A. 


Next,  we  establish  the  asymptotic  properties  of  the  top-r  thresholding 
scheme  Ntop,r{o)  in  (3.1),  the  hard  thresholding  scheme  Nfiard{cL,b)  in  (3.4), 
and  the  combined  thresholding  scheme  Ncomb,r{o-,b)  in  (3.5)  when  the  de¬ 
tection  threshold  a  goes  to  oo,  regardless  of  the  false  alarm  constraint  (2.1). 
The  proof  of  the  following  theorem  is  very  technical  and  is  presented  in 
detail  in  the  appendix. 


Theorem  5.2.  Let  Na^  be  the  proposed  thresholding  schemes  Ntop,r{a), 
d^hard{cL,  b)  or  Ncomb,r{cL,  b).  As  a  ^  OO,  let  h'  =  b' (o)  be  a  eonstant  such  that 
both  b'  and  a  —  b'  go  to  oo. 


(i)  The  proposed  scheme  Nafi  satisfies 


(5.5) 


1  +  Q-  +  ^  +  •  •  •  + 


for  all  b  >  0. 

(ii)  For  any  combination  {5i, ...  ,5k)  £  A  defined  in  (5.1),  and  for  all 
0  <  b  <  b' ,  we  have 


a 

J{5i, . . .  ,5k) 


(5.6) 


+  0{Vb')  +  5 

max 


where  J((5i, . . . ,  5k)  is  defined  in  (5.4),  and  the  relation  holds  for  the 
top-r  thresholding  seheme  Ntop,r{a)  in  (3.1)  and  the  combined  thresh¬ 
olding  scheme  Ncomh,r{a,  b)  in  (3.5)  with  b  >0  whenever  J2k=i  < 
oo}  <  r,  i.e.,  when  the  occurring  event  affects  at  most  r  data  streams. 

Finally,  we  are  now  in  a  position  to  establish  the  asymptotic  optimal¬ 
ity  properties  of  our  proposed  thresholding  schemes  Na^b  in  the  sequen¬ 
tial  change-point  detection  problems  with  the  false  alarm  constraint  7  in 

(2.1) ,  as  7  goes  to  00.  Here  we  will  make  an  additional  assumption  that 
5max  =  o(log7)  as  7  goes  to  00,  and  such  assumption  can  be  easily  satished 
when  all  hnite  S^s  are  uniformly  bounded  by  some  constant  that  does  not 
depend  on  the  false  alarm  constraint  7. 

Corollary  5.1.  For  a  given  K  and  for  any  6  >  0,  with  the  choice  of 

(5.7)  a  =  a^  =  log'j  -\-  (K  -  1  -\-  o(l))  log  log  7, 

the  proposed  threholding  scheme  Na^  satisfies  the  false  alarm  constraint 

(2.1) .  Moreover,  if  we  assume  thatS^ax  =  o(log7),  then  the  hard-thresholding 
schemes  Nhard{a,b)  in  (3.4)  asymptotically  minimize  F,Sj^^,„^5j^{Nhard{a,b)) 
(up  to  the  first-order)  for  each  and  every  post-change  hypothesis  ((5i, . . . ,  5k)  £ 
A  subject  to  the  false  alarm  constraint  (2.1),  as  j  in  (2.1)  goes  to  00.  The 
conclusion  also  hold  if  Nhard{a,  b)  is  replaced  by  either  the  top-r  thresholding 
scheme  Ntop,r  in  (3.1)  or  the  combined  thresholding  scheme  Ncomb,r{a,b)  in 
(3.5)  when  the  occurring  event  affects  at  most  r  data  streams,  i.e.,  when 
((5i, . . . ,  5k)  £  A  satisfies  J2k=i  ^{^k  <  00}  <  r. 


Proof:  This  corollary  follow  at  once  from  Theorems  5.1  and  5.2.  In 


particular,  the  choice  of  in  (5.7)  follows  from  (5.5)  and  the  fact  that 
1  +  a  +  ^  +  •  •  •  +  (^_iy  ~  {K-iy.  ^  fixed  and  a  goes  to  oo.  □ 

It  is  useful  to  add  several  remarks.  First,  it  is  worth  emphasize  the  special 
case  (5max  =  0  of  Corollary  5.1.  While  existing  statistical  methods  can  pro¬ 
vide  asymptotically  optimal  solutions  for  this  scenario,  they  are  generally 
infeasible  in  practice.  For  instance,  consider  a  case  that  the  occurring  event 
is  known  to  affect  at  most  r  data  streams  instantaneously  but  we  do  not 
know  the  actual  subset  of  affected  data  streams.  To  develop  asymptotically 
optimal  schemes,  the  classical  generalized  likelihood  ratio  method  would  re¬ 
quire  us  to  search  all  possible  (^)  (^)  combinations  of  affected  data 

streams,  which  can  be  too  huge  from  the  computational  viewpoint.  On  the 
other  hand,  the  proposed  top-r  thresholding  scheme  Ntop,r  in  (3.1)  is  not 
only  asymptotically  optimal  for  this  case  but  also  computationally  feasible 
to  be  implemented  for  large  K  over  long  time  period. 

One  may  worry  about  the  term  dmax  in  (5-6)  at  first  glance.  However, 
we  want  to  emphasize  that  this  is  just  some  technical  details  to  simplify 
our  arguments.  Otherwise  we  can  focus  on  a  subset  of  {!,...,  iF}  :  K,  = 
{k  ■.  5k  =  0(log7)},  and  pretend  that  we  are  just  monitoring  those  data 
streams  inside  the  subset  K,  without  worrying  those  outside  of  1C.  This  is 
because  subject  to  the  false  alarm  constraint  7  in  (2.1),  the  detection  delays 
are  typically  in  the  order  of  logy,  and  the  data  streams  inside  /C  will  raise 
an  alarm  at  time  v  +  ©(logy)  before  the  event  affects  those  data  streams 
outside  of  K,.  In  other  words,  the  term  dm  ax  in  (5.6)  can  be  easily  replaced  by 
0(log7)  (although  the  term  J{5i, . . .  ,5k)  shall  also  be  defined  accordingly, 
i.e.,  only  consider  those  5k  inside  JC).  In  addition,  in  our  corollary  we  make  a 
stronger  assumption  dmax  =  o(log7)  instead  of  ©(logy)  so  that  we  can  take 


advantage  of  the  well-known  asymptotic  lower  bounds  on  detection  delays 
subject  to  the  false  alarm  constraint  7  in  (2.1).  However,  this  is  not  essential 
either,  and  we  conjecture  that  the  proposed  thresholding  schemes  hold  their 
respective  asymptotic  optimality  properties  regardless  of  the  value  of  (Jmax- 

It  is  also  interesting  to  see  what  happens  if  both  K  and  7  go  to  00.  We  do 
not  have  rigorous  mathematical  arguments,  but  our  theorems  seem  to  lead 
to  the  following  heuristic  arguments.  Note  that  the  right-hand  side  of  (5.5) 
can  be  written  as  1/P{Uk  >  a),  where  Uk  is  the  sum  of  K  independent 
exponential  random  variables  with  mean  1.  By  the  theory  of  large  deviations, 
for  any  constant  6  >  1,  we  have 

lim  — logF (Uk  >  Kb)  =  b  —  1  —  log(6), 

K—^00  K 

see,  for  example,  Durrett  [5,  Ch.  1.9].  Assume  logy  =  K{b—  1  —  log(6)),  then 
we  can  choose  the  threshold  a  =  Kb  to  satisfy  the  false  alarm  constraint  (2.1) 
if  we  assume  that  the  lower  bound  (5.5)  still  holds  when  K  goes  to  00.  By 
Theorem  5.2,  the  asymptotic  efficiency  of  our  proposed  thresholding  schemes 
as  compared  to  the  lower  bound  in  Theorem  5.1  will  be  6/(6  —  1  —  log  6), 
which  goes  to  1  if  6  goes  to  00,  or  equivalently,  if  6— 1— log(6)  =  logy/AT  goes 
to  00.  In  other  words,  the  asymptotic  optimality  properties  of  our  proposed 
thresholding  schemes  seem  to  still  hold  when  the  number  K  of  data  streams 
and  the  false  alarm  constraint  7  go  to  00  in  such  a  way  that  log  7/ AT  also 
goes  to  00.  However,  in  other  scenarios,  our  proposed  schemes  no  longer 
achieve  the  asymptotic  lower  bound  in  Theorem  5.1,  which  may  or  may  not 
provide  a  sharp  lower  bound  on  the  detection  delays  when  K  goes  to  00. 

6.  Discussion.  From  the  methodology  viewpoint,  our  proposed  ap¬ 
proaches  can  be  easily  extended  to  other  reasonable  local  detection  statis¬ 
tics  or  other  more  complicated  models,  as  long  as  the  observations  are  in- 


dependent  among  different  data  streams.  As  an  example,  besides  the  lo¬ 
cal  CUSUM  statistics  another  widely  used  local  detection  statistics 

is  the  quasi-Bayesian-type  statistic  of  Shiryaev  [25]  and  Roberts  [23].  The 
Shiryaev-Roberts  statistic  is  defined  by 


Rk 


,n  — 


^yrgki^ 

u=li=u  fk{Xk,i) 


and  has  a  recursive  formula:  Rk,n  =  (1  +  Rfc,n-i)  ^ith  Rkfl  =  0. 

If  we  compare  e'^"  with  it  is  easy  to  see  that  the  maximum  over 

the  change-point  v  in  the  CUSUM  statistic  is  replaced  by  the  sum  in  the 
Shiryaev-Roberts  statistic  Rn,  and  it  is  well-known  that  Page’s  CUSUM  and 
Shiryaev-Roberts  procedures,  based  on  Wk,n  and  log  Rk,n  respectively,  have 
similar  performance  when  monitoring  a  single  local  one-dimensional  data 
stream,  see,  Poliak  and  Siemgund  [21].  Hence,  it  is  natural  to  ask  whether 
the  local  CUSUM  statistics  Wn  in  our  proposed  schemes  can  be  replaced  by 
logRfc^jj,  the  Shiryaev-Roberts  statistic  in  the  logarithm  scale.  Our  prelim¬ 
inary  numerical  simulation  study  seems  to  suggest  that  the  answer  is  still 
a  “Yes”  when  monitoring  a  large  number  of  data  streams,  as  the  Shiryev- 
Roberts  like  version  also  has  similar  performance  as  the  CUSUM  version  in 
our  simulation  study  in  Section  4. 

Likewise,  our  approaches  thresholding  schemes  can  also  be  easily  adjusted 
for  more  complicated  models,  e.g.,  when  the  post-change  distribution  involve 
unknown  parameters.  Extensive  research  has  been  done  in  the  past  decades 
to  deal  with  more  complicated  models  for  one  dimensional  data  stream,  see, 
Poliak  and  Siemgund  [21],  Siegmund  and  Venktraman  [27],  Lai  [9],  Lorden 
and  Poliak  [12].  As  an  illustration,  assume  that  the  observations  at  the  k- 
th  data  stream  have  a  fk  distribution  before  the  change,  but  may  have  a 
gkfik  distribution  after  the  change  if  the  k-th  data  stream  is  affected  by  the 


occurring  event,  where  the  value  9^  is  unknown  and  may  differ  for  different 
k.  In  this  case,  several  theoretically  efficient  and  computationally  simple 
local  schemes  have  been  proposed,  and  most  have  the  form  that  raises  an 
alarm  if  a  local  detection  statistic  at  time  n,  say,  Wj^^,  exceeds  some  pre¬ 
specified  constant.  To  be  more  specific,  let  us  consider  the  schemes  proposed 
in  Lorden  and  Poliak  [12],  in  which  one  first  find  an  estimator  9k,v,n-i  of  9k 
by  the  maximum  likelihood  or  method  of  moments  methods  based  on  the 
observations  from  Xk^u^  • ' '  for  each  given  1  <  v  <  n  —  1  and  then 

use  the  estimator  9k,u,n-i  to  construct  a  local  detection  statistic  that 
mimics  the  local  CUSUM  or  Shiryaev- Roberts  statistics,  e.g., 

=  max  {0, ,  glog 


or 


iT,%=iog^n 


fkiXk^i) 


v=l  i=v 

Then  our  proposed  top-r  or  hard  thresholding  rules  or  both  can  be  applied 
to  these  local  detection  statistic  too,  thereby  providing  scalable  global 
schemes  in  the  case  when  unknown  parameters  are  present  in  the  post¬ 
change  distributions. 

When  the  local  detection  statistic  is  the  Shiryaev-Roberts  statistics  or 
other  complicated  local  detection  statistic,  we  conjecture  that  the  proposed 
thresholding  schemes  still  hold  certain  asymptotic  optimality  properties.  Un¬ 
fortunately,  we  have  so  far  been  unable  to  provide  a  full  proof.  The  main 
challenge  is  to  establish  a  lower  bound  on  the  average  run  length  to  false 
alarm  in  parallel  to  those  in  (5.5).  Another  difficulty  is  how  to  choose  the 
hard-thresholding  fe^’s  in  more  complicated  models. 


Appendix:  Proof  of  Theorems  5.1  and  5.2.  This  section  is  devoted 


to  prove  our  main  theorems,  Theorems  5.1  and  5.2. 


Proof  of  Theorem  5.1.  Intuitively,  only  those  affected  data  streams  pro¬ 
vide  information  to  detect  the  occurring  events,  and  the  quickest  possible 
way  to  detect  the  occurring  event  is  when  the  event  affects  the  data  streams 
instantaneously.  More  rigorously,  if  we  define 

{0,  if  5h  is  finite 

, 

oo,  if  (5fc  =  oo 

then  for  any  given  scheme  T{'y), 

E5i,...,5k(^(7))  >  infE5«,...,5^(T), 

where  the  infumum  is  taken  over  all  possible  scheme  r  satisfying  the  false 
alarm  constraint  (2.1).  An  alternative  and  possible  better  viewpoint  is  based 
on  a  time-shifting  argument  in  which  one  imagines  that  at  time  n  one  ob¬ 
serves  the  observations  Xk^n-Sf,  (instead  of  Xk^n)  when  5k  is  finite,  and  then 
applies  T{'y)  to  the  new  aligned  observations. 

Without  loss  of  generality,  assume  that  m  out  of  K  data  streams  are 
affected  by  the  event,  and  5*  =  0  for  1  <  i  <  m,  and  =  oo  for  m-|- 1  <i<K. 
That  is,  the  first  m  data  streams  are  affected  abruptly  and  simultaneously 
by  the  event  at  unknown  time  v,  and  other  data  streams  are  unaffected. 
Then  it  is  well-known  [16]  that  the  corresponding  optimal  scheme  is  the 
CUSUM  procedure 

'^cusum(^)  ~  ^  >  a} 


where  the  corresponding  CUSUM  statistic  is  given  by 

1U(-)  =  max  (luS  +  £  log  o)  • 


k=l 


By  (5.4),  J{5i,  ...,5k)  =  J{51,  ...,5*j^)  =  YaLi  HSiJi)  and  thus  the  right- 
hand  side  of  (5.3)  is  the  detection  delay  of  the  optimal  scheme 


that  provides  a  lower  bound  on  the  detection  delays  of  all  other  schemes. 
Therefore,  relation  (5.3)  holds  and  this  completes  the  proof  of  Theorem  5.1. 


Proof  of  Theorem  5.2.  Let  us  first  focus  part  (i)  on  the  properties  of  the 
hard  thresholding  scheme  Nhard{<i,b)  in  (3.4). 

To  prove  (5.5)  in  part  (i),  note  that  each  of  the  proposed  thresholding 
schemes  Na^b  dominates  the  “SUM”  scheme  Tsum(d  =  a)  in  (2.5).  That  is, 
for  any  6  >  0,  Na,b  >  Usum(a)  and  'E^'^\Na,b)  >  ^^°°HTsnm{a)) .  By  Theorem 
1  of  Mei  [14],  the  “SUM”  scheme  Tsiim(a)  satisfies  relation  (5.5),  and  so  are 
the  proposed  thresholding  schemes  f,  for  all  6  >  0. 

Now  let  us  prove  relation  (5.6)  in  part  (ii),  and  we  first  focus  on  the  hard 
thresholding  scheme  Nhardio^b).  It  is  clear  that  the  worst-case  detection 
delay  of  Nhardi^ib)  occurs  at  the  change-point  =  1,  and  thus  it  suffices 
to  show  that  E^^^~^\^{Nhard{o,,b))  satisfies  (5.6).  Without  loss  of  generality, 
we  assume  that  only  the  first  m  data  steams  are  affected  and  no  other 
data  streams  are  affected.  To  simply  our  notation  below,  denote  (5max  = 
maxi<j<m  Si.  Since  Nhard{a,  b)  is  increasing  as  a  function  of  6  >  0,  it  suffices 
to  show  that 


(6.2)  B^^-X{Nhard{a,  b'))  <  +  0{VV)  +  <5„,ax, 

as  b'  and  a  —  b'  go  to  oo. 

To  prove  (6.2),  the  essential  idea  is  to  compare  Nhard{o-,b')  with  a  new 
stopping  time  that  starts  to  monitor  the  change  at  time  (5max  and  is  in 
the  form  of  the  one-sided  sequential  probability  ratio  test  (SPRT).  Define  a 
stopping  time 


r(a,  b')  =  first  n  such  that  log  >  a  and 

T  log  ^  Pkb'  for  all  1  <  <  m, 


i=l 


fk{Xk,i 


(6.3) 


where  pk  is  defined  in  (3.2),  and  let  fs{a^  b')  be  the  new  stopping  time  that 
applies  T{a,b')  to  the  new  observations  after  time  5max-  Clearly,  whenever 
fs{a,  b')  stops,  our  proposed  scheme  Nhard{a,  b')  also  stops  (possibly  earlier). 
Thus 


=  5n,ax-l  +  Eir5^(r(a,6')), 

where  5^  is  defined  in  (6.1).  To  simplify  the  notation,  denote  by  the 
expectation  when  the  change  occurs  at  time  u  =  \  and  the  event  affects  the 
first  m  data  streams  immediately  but  does  not  affect  the  other  remaining 
K  —  m  data  streams.  So  it  suffices  to  show  that  the  stopping  time  r(a,  b') 
in  (6.3)  satisfies 


(6.4) 


E(i)(r(a,6'))  < 


Ek=iIigkJk) 

To  prove  this,  for  1  <  A:  <  m,  let 

Mk+n 


+  o{Vv). 


Pkb'Y 


TkiMk)  =  sup|n>l;  ^  log^^|^^<o| 


i — -|-1 

M  =  max  (Mk  +  Tk{Mk)  +  l) 
l<k<m  V  '  "  / 


M+n  m 


t(M)  =  i:  (Elogfim 


>  a  — 


m 

iJ2pkW}- 


k=l 


Note  that  in  the  definition  of  t{M),  the  threshold  a  —  {J2T=i  Pk)b'  goes  to 
oo  as  a  ^  oo,  since  ^2^=1  Pk  <  J2i^=i  Pk  =  ^  and  a  —  b'  is  assumed  to  go  to 
oo.  Combining  these  definitions  with  those  of  r(a,  b')  in  (6.3)  yields  that 


T(a,  6')  <  M  +  t{M)=  max  (Mk  +  Tk{Mk)  +  l)+t{M) 

l<k<m  V  / 


<  V  rfc(Mfc)  +  1 +  t(M)  +  max  M*.. 

l<k<m 

k=l 

Hence,  to  prove  (6.4),  it  suffices  to  establish  the  following  three  relations: 

(6.5)  E(i)(rA,(Mfc)) 

(6.6)  E(i)(f(M)) 


=  0(1)  for  all  1  <  A:  <  m; 

a  b' 


< 


(6.7)  E(^)f  max  Mk) 

V  l<k<m  / 


< 


ET=iH9kJk)  EtiHgkJk) 

b' 


+  0(1); 


Y.k=il{9k,fk) 


+  0{VU). 


Relation  (6.5)  is  well-known  in  renewal  theory,  e.g..  Theorem  D  in  Kiefer 
and  Sacks  [7],  since  log  {gk{^) / fk{X))  has  positive  mean  and  hnite  variance 
under  E^^^  by  Assumption  (A2).  Relation  (6.6)  also  follows  at  once  from  the 
standard  renewal  theory 


b' 


+  0{l), 


E™=1 1{9k,  fk)  Ef=i  H9k,  fk) 
as  a  — b'  goes  to  oo,  where  the  second  equality  follows  from  the  dehnition  of 
Pk  in  (3.2). 

The  proof  of  relation  (6.7)  is  a  little  more  complicated,  but  it  can  be  done 
along  the  same  line  as  that  in  Mei  [13].  Specihcally,  by  renewal  theory  and 
Assumption  (A2),  under 

b' 


E(')(Mfc)  =  -^^  +  0(l)  =  ^ 

I{9k,fk)  Y.k=iH9k,fk) 


+  0(1) 


and  Var*^^^(Mfc)  =  0{b'),  as  6  — )•  oo,  see  Siegmund  [26,  p.  171].  Hence, 

b'  b' 


E 


(  max  Mfc) 

V  l<k<m  / 


„  -|- E*-^^  max  (Mk - 

EtiiigkJk)  Ek=iH9kJk) 

L/  ^  Uf 


<  - +  yE« 

Ek=iH9kJk)  ^ 


k=l 


Y.k=iH9k,fk) 


< 


b' 


Ek=iI{9kJk) 


+  oiVU), 


where  the  last  inequality  follows  from  the  fact  that 


(E«|Mfc 


b' 


Ek=ili9k,fk) 


\y  <  E«(Mfc 


Ek=iH9k,fk) 

Var(i)(Mfc)  +  (E(i)Mfc 


6' 


J2k=iI{9kJk) 


=  o{b'). 


Therefore,  relations  (6.5)-(6.7)  are  proved,  and  hence  relation  (5.6)  holds  for 
the  hard-thresholding  scheme  Nhard{a-,b)  in  (3.4). 

Now  let  us  provide  a  sketch  of  the  proof  for  part  (ii)  of  Theorem  5.2  on 
the  top-r  scheme  Ntop,r{o)  in  (3.1)  and  the  combined  thresholding  scheme 
^comb,r{a,b)  in  (3.5).  Since  Ntop,r(fl)  is  a  special  case  of  Ncomb,r{<i,b)  with 
6  =  0,  it  suffices  to  prove  the  theorem  for  Ncomb,r{a-,b)  in  (3.5)  with  b  > 
0.  The  properties  of  false  alarms  are  straightforward  since  the  centralized 
“SUM”  scheme  Tsum(d)  again  provides  the  lower  bound  and  thus  relation 
(5.5)  also  holds  for  Ncomb,r{(i,  b)  with  6  >  0. 

It  remains  to  show  that  when  the  occurring  event  affects  at  most  r 
data  streams,  i.e.,  when  Y^k=i^{^k  <  oo}  <  r,  relation  (5.6)  holds  for 
^comb,r{cL,b)  with  6  >  0.  Without  loss  of  generality,  assume  that  the  af¬ 
fected  data  streams  are  just  the  first  m  data  streams  with  m  <  r.  Recall 
that  Uk^n  =  >  Pk^}  >  0,  and  we  order  the  non-negative  Uk^n^ 

as  >  ■■■>  U{K),n,  and  Ncomb,r{a,b)  stops  if  J2k=iU(k),n  >  a.  When 

m  <  r,  we  have 

r  r  m 

^  (  U{k),n  —  ^  (  Uk^n  ^  ^  (  Uk^n- 
k=l  k=l  k=l 

Thus,  if  at  some  time  no  we  have  Wk^no  ^  Pkb  and  Wk,no  ^  a  for 

1  <  k  <  m  (i.e.,  for  the  first  m  data  streams),  then  Ncomb,r{a-,b)  will  also 


stop  at  time  no  and  possibly  earlier.  Hence,  whenever  m  <  r,  the  stopping 
time  r(a,  b')  in  (6.3)  also  provides  an  upper  bound  on  the  detection  delay  of 
^comb,r{a,  b).  Thus  the  proposed  combined  thresholding  scheme  Ncomb,ri(i,  b) 
in  (3.5)  satisfies  relation  (5.6)  whenever  the  occurring  event  affects  at  most 
r  data  streams.  This  completes  the  proof  of  Theorem  5.2.  □ 
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Inequality  and  Its  Applications  to  Quantization 

Effects  on  Detection 
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Abstract 

It  is  well  known  that  quantization  cannot  increase  the  Kullback-Leibler  divergence  which  can  be  thought  of  as 
the  first  moment  of  the  log-likelihood  ratios.  In  this  paper,  this  result  is  extended  to  the  case  of  second  and  higher 
moments  of  the  log-likelihood  ratios.  It  is  shown  that  quantization  may  result  in  an  increase  of  the  second  or  higher 
moments  of  the  log-likelihood  ratio,  but  such  an  increase  is  at  most  by  a  universal  constant  that  only  depends  on  the 
value  of  the  moment.  Such  a  constant  is  2/e  for  the  second  moment.  The  result  is  further  applied  to  decentralized 
quickest  detection  problems  to  provide  a  simpler  sufficient  condition  for  the  asymptotic  optimality  theory. 

I.  Introduction 

Let  X  be  a  random  variable  taking  value  in  some  probability  space  In  some  applications,  the  X 

itself  is  unobservable  and  what  is  actually  observed  is  another  variable  Y  that  is  a  quantization  of  X,  or  more 
generally,  a  function  of  X,  say  Y  =  (j){X).  Thus  one  has  to  utilize  the  Y  to  develop  statistical  procedures  to 
make  decisions.  In  order  to  guarantee  the  success,  one  may  need  to  verify  that  the  distribution  of  Y  satisfies  some 
necessary  requirements.  However,  it  can  be  analytical  challenging  or  untractable  to  verify  the  assumptions  for  Y 
directly,  even  if  the  distributions  of  the  unobservable  X  are  known  to  belong  to  some  simple  family  of  distributions. 
For  instance,  this  may  happen  when  one  only  has  very  limited  knowledge  about  the  function  (j),  e.g.,  (p  belongs 
to  some  infinite-dimensional  functional  space.  To  overcome  such  a  difficulty,  it  is  natural  to  investigate  whether 
certain  properties  of  the  X  will  make  sure  that  the  requirements  of  Y  are  satisfied,  see  Le  Cam  and  Yang  [5]  for 
more  detailed  discussions  and  some  concrete  examples. 

Our  research  is  motivated  by  the  decentralized  sequential  detection  where  one  wants  to  find  an  appropriate 
quantization  function  (p  so  as  to  make  the  best  possible  decision  subject  to  the  constraint  that  the  observed  variables 
Y’s  belong  to  a  finite  alphabet.  This  requires  us  to  investigate  the  properties  of  the  moments  of  log-likelihood  ratio 
statistic  under  quantization  or  mapping.  Suppose  that  Hq  :  P  =  Pq  and  iJi  :  P  =  Pi  are  two  simple  hypotheses 
regarding  the  distribution  P  of  the  X,  and  the  X  has  a  density  fi{x)  under  (or  Pi)  with  respect  to  some  common 
underlying  probability  measure.  In  information  theory  and  statistics,  the  log-likelihood  ratio  of  X,  or  the  logarithm 
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of  Radon-Nikodym  derivative  of  Pi  with  respect  to  Pq,  is  defined  by 

and  the  widely  used  Kullback-Leibler  divergence  of  the  X,  denoted  by  I{fi,fo),  is  just  the  first  moment  of  Z 
under  Pi,  i.e.,  /(/i,  /o)  =  Ei{Z}.  Likewise,  for  the  Y  =  (j){X),  denote  by  Pf  and  fi{y;  (j>)  its  probability  measure 
and  its  probability  mass  or  density  function  fi{y;  </>)  under  Hi.  Then  the  log-likelihood  ratio  of  F  =  (p{X)  is  given 

and  the  Kullback-Leibler  divergence  of  the  Y  is  /o)  =  Ei  {Z^}  .  An  important  property  is  that  the  Kullback- 

Leibler  divergence  cannot  increase  under  a  mapping,  that  is. 


iMufo)<Hfi,fo)  (1) 

with  equality  if  and  only  if  F  =  (j>{X)  is  a  sufficient  statistics  of  X,  see  Theorem  4.1  of  Kullback  and  Leibler  [3]. 
This  is  consistent  with  our  intuition  that  F  =  is  generally  less  informative  than  the  X  itself.  Note  that  the 

inequality  (1),  which  will  be  referred  as  Kullback-Leibler’ s  inequality  below,  deals  with  the  first  moment  (or  mean) 
of  the  log-likelihood  ratios. 

The  main  goal  of  the  present  paper  is  to  extend  Kullback-Leibler’s  inequality  (1)  to  deal  with  the  second  or  other 
higher  order  moments  of  log-likelihood  ratios.  Section  II  extends  the  Kullback-Leibler’s  inequality  (1)  to  second 
moments,  and  Section  III  further  extends  it  to  other  general  higher-order  moments.  In  Section  IV,  our  results 
are  applied  to  decentralized  sequential  detection  to  provide  much  simplified  sufficient  conditions  for  asymptotic 
optimality  theories.  Section  V  gives  concluding  remarks. 


11.  Second-order  Moments 

For  the  X  and  F,  define  their  respective  second  moments  of  log-likelihood  ratios  as 

and 

Our  main  result  is  as  follows. 

Theorem  1.  For  any  measurable  function  (j>,  we  have 

VMiJo)<V{fi,fo)  +  -.  (2) 

e 

Proof:  Let  L  =  =  fi{X)/fo{X)  and  =  fi{Y;(j))/ fo{Y](j))  be  the  likelihood  ratios.  Recall  that 

in  the  proof  of  the  Kullback-Leibler’s  inequality  (1),  a  key  idea  is  to  use  the  fact  that  the  function  H{t)  =  —  \ogt  is 
convex,  so  that  we  can  apply  Jensen’s  inequality  to  Z^p  =  logL^  =  However  this  approach  fails  for  the 
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second  moment  case  since  the  function  i?2(f)  =  (~  logf)^  =  (logf)^  is  no  longer  convex  (or  concave).  Fortunately, 
this  idea  can  be  salvaged  by  finding  a  convex  function  that  dominates  H2{t).  Specifically,  note  that  the  function 
H2{t)  =  (logf)^  is  convex  on  f  <  e  but  is  concave  on  f  >  e,  and  thus  we  can  consider  a  new  function 

{(logf)^,  if  0  <  f  <  e 

(3) 

—  1  if  f  >  e 

Then  it  is  easy  to  see  that  H2{t)  is  a  continuous  convex  function  of  t  when  f  >  0.  Moreover,  H2it)  >  ^2(f)  = 
(logf)^.  In  fact,  H2(t)  equals  to  H2{t)  when  t  <  e  and  becomes  linear  when  t  >  e,  see  Fig.  1. 

To  prove  our  theorem,  let  denote  the  conditional  expectation  with  respect  to  a  given  value  of  the  observed 

data  Y  =  £  0,  then 


Since  Jl2(t) 


(logt)^ 


fo(X) 


hiX) 


Y 


/o(l^;</>)  _  1 


H2{t  for  f  >  0,  by  the  definition  of  fo),  we  have 


V^iJo)  =  E,{{\ogiL^)f}  =  E,{H2{L^^)} 


^  ^1 


=  e^{h2{e,{l-^\y})^ 

<  Ei{ei{h2{l-^)\y}} 

=  Ei{h2{L-^)}. 

Now  by  the  definition  of  H2(t)  on  equation  (3), 

E^[h2{L-^)]  =  E^[H2{L-^)l{L-^<e)]+E^[H2{L)l{L-^>e]] 

=  >e}| 

<  E^{H2{L-^))  +  -^E^{L-^)-Pi{L-^  >e} 

<  V{fufo)  +  - 

e 

where  the  last  inequality  follows  from  the  facts  that  l/(/i, /o)  =  ii^i{iT2(.^~^)}, =  J (Jq^x)/ fi{x))fi{x)dx 
1  and  Pi{L~^  >  e)  >  0.  Combining  the  above  inequalities  yields  (2),  completing  the  proof  of  the  theorem.  ■ 


It  is  useful  to  provide  some  comments  to  better  understand  our  theorems.  First,  the  discrete  version  of  the 
Kullback-Leibler’s  inequality  (1)  is  the  well-known  log-sum  inequality:  for  nonnegative  numbers  ai,...,a„  and 
bi, ...  ,bn,  denote  the  sum  of  all  a^’s  by  a  and  the  sum  of  all  b^s  by  b,  and  then  we  have 


log^  <^a,log^ 


i=l 


with  equality  if  and  only  if  Ui/bi  are  constant.  Meanwhile,  the  discrete  version  of  our  main  result  (2)  becomes  that 

,2  JE  /  n,\2  2 


a(iog5)  +-b, 


2  =  1 
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Fig.  1:  Dominating  Function  H2{t) 


where  the  extra  term  on  the  right  side  is  2b je  instead  oi^je  since  we  do  not  put  any  normalization  conditions  on 
a  or  b. 

Second,  a  comparison  of  (1)  and  (2)  shows  that  we  have  an  extra  constant  term  2/e  for  the  second  moment 
case,  and  thus  it  is  natural  to  ask  whether  or  not  the  term  can  be  eliminated,  i.e.,  whether  it  is  always  true  that 
fo)  ^  ^(/i)  /o)-  The  following  counterexample  provides  a  negative  answer.  Suppose  that  the  X  takes  three 
distinct  values  0,  1,  2  with  probabilities  29/36,  1/9,  1/12  under  Pi  and  equal  probabilities  1/3  under  Pq.  Let 
<j>  he  a  function  with  a  binary  range  {0, 1}  such  that  (/(O)  =  0,  (/(I)  =  (j){2)  =  1.  Then  it  is  easy  to  verify 
that  V{fi,fo)  =  0.9215  <  V/,(/i,/o)  =  0.9224.  More  generally,  other  counterexamples  can  be  easily  found  by 
choosing  two  distributions  Pi  and  Pq  of  X,  both  of  which  are  supported  on  n  +  1  (n  >  2)  points  xq,  ...  ,Xn  such 
that  the  likelihood  ratio  Lq  =  fo{xo)/fi{xo)  <  e  and  Li  =  fo{xi)/fi{xi)  >  e  for  i  =  1, . . . , n  with  Li, . . .  ,Ln 
being  n  distinct  values.  Then  if  we  consider  a  function  (p  that  maps  all  xi, . . . ,  Xn  to  a  single  point  yi  but  maps 
Xq  to  another  point  yQ,  then  l^(/i,/o)  <  T/,(/i,/o).  To  see  this,  note  that  H2{t)  =  (logt)^  is  strictly  concave  on 
f  >  e,  so 


fojxi) 


.  n  t  i...  iiLT  foixi) 

<  -  fl{xo))H2  [  y  - - r—, — r 

\f^l  1  -  fl[Xo) 
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and 


VifiJo) 


<  /l(xo)i?2  +/i(2/i;0)H2 

=  V^ihJo). 


(  fo{vi'A)\ 
\h{vi'A)) 


In  other  words,  unlike  the  case  of  Kullback-Leibler’s  inequality,  a  map  indeed  can  inflate  the  second  moment  of 
the  log-likelihood  ratio.  Fortunately,  our  theorem  shows  that  such  an  inflation  is  at  most  2/e. 


III.  General  Higher-order  Moments 

The  technique  we  developed  in  proving  Theorem  1  can  be  useful  to  deal  with  higher-order  moments  of  the 
log-likelihood  ratios.  To  be  specific,  for  a  positive  integer  j  =  1, 2, . . . ,  define 

W,  (A ,fo)  =  B,{  (zy  }  =  £;i  I  (^log  Illy ) '  I  (4) 

and 

=  E,  {{Z^y)  =  Si  I  (^iog||^y  I .  (5) 

It  turns  out  that  we  need  to  consider  two  different  cases,  depending  on  whether  j  is  even  or  odd.  For  the  purpose 
of  our  theorem,  let  us  define  two  sequence  of  constants.  For  any  integer  J  >  1,  define 

^  _  jXj  - 

and  when  j  is  odd,  further  define  C*  to  be  the  only  real  number  x  >  0  that  satisfies  the  equation 

X  =  {j  —  1)-^“^  —  Cj  exp{—x^Ey  (5) 

By  convention  we  set  0°  =  1,  and  thus  Ci  =  1  and  Ci  =  0. 

The  following  theorem  involves  higher-order  moments  of  the  log-likelihood  ratios,  and  includes  the  Kullback- 
Leibler’s  inequality  (1)  and  relation  (2)  for  second-order  moment  as  special  cases. 

Theorem  2.  For  any  measurable  function  f  and  any  integer  j  >  1,  we  have 

W^,,{h,fo)<WjihJo)  +  B,  (7) 

where  the  constant  B  =  Cj  if  j  is  even  and  B  =  C*  if  j  is  odd.  Moreover,  lF0j(/i,/o)  and  Wj{fi,fo)  have  a 
lower  bound  0  when  j  is  even,  and  have  a  lower  bound  —j{j  —  /e^~^  —  (j  —  1)^  when  j  is  odd. 

We  will  prove  Theorem  2  in  two  separate  cases,  depending  on  whether  j  is  even  or  odd.  Let  us  begin  with  the 
case  when  j  is  even,  and  we  will  prove  a  more  general  result  on  the  a-moments  of  the  absolute  values  of  the 
log-likelihood  ratios  Z  and  Zj,  for  any  real  number  a  >  1.  Specifically,  define 

VFA/i,/o)  =  i^i{|^r}  =  i^i|  log^jlj  I 
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and 

log^(r)  |. 

Lemma  1.  For  any  a  >  1, 

W^AfiJo)  <  WMiJo)  +  Co,  (8) 

where  the  constant  Ca  =  —  >  0  and  Ci  =  1  by  convention  that  0*^  =  1. 

Proof:  While  the  function  Ha{t)  =  \  logf|“  is  not  convex,  it  can  be  shown  that  it  is  dominated  by  the  following 
convex  function 

Ha{t),  if  0  <  t  <  ta, 

HA)  =  < 

I  Cat  —  da  if  t>  ta, 

where  ta  =  da  =  {a  —  1)““^  >  0,  and  Ca  =  ^  The  remaining  proof  is  identical  to  those  of 

Theorem  1  and  thus  omitted.  ■ 

As  in  Theorem  1,  it  is  generally  not  true  that  W0,q(/i,  /o)  <  Wa{fi,  fo),  and  the  counterexamples  can  be  easily 
found  by  exploring  the  fact  that  for  any  a  >  1,  the  function  Ha{t)  is  always  strictly  concave  when  f  >  In  other 
words,  the  counterexamples  can  be  constructed  by  picking  n  + 1  (n  >  2)  points  xq,  ■  ■  ■  ,Xn  and  two  distributions  Pi 
and  Pq  such  that  Lq  =  fo{xo) / fi{xo)  <  ta  while  Li  =  fo{xi)/fi{xi)  >  ta  n  distinct  values  for  7  =  1, . . . ,  n, 
and  then  proceeding  as  in  the  case  of  a  =  2. 

It  is  also  interesting  to  compare  the  Kullback-Leibler’s  inequality  (1)  with  the  case  a  =  1  in  Lemma  1;  we 
have  Ei{ZA  <  Ei{Z}  and  Ei\Z^\  <  Ei\Z\  +  1.  In  other  words,  while  the  first  moment  of  the  log-likelihood 
ratio  always  decrease  after  a  mapping,  the  first  moment  of  its  absolute  value  can  indeed  increase  although  such  an 
increase  is  at  most  1.  This  is  because  the  function  —  logf  is  convex  on  f  >  0  but  the  function  |  logf|  is  not  convex. 

Now  let  us  prove  Theorem  2  when  j  >  1  is  odd.  Fix  the  odd  integer  J  >  1,  and  the  key  is  to  find  a  convex 
function  that  dominates  H{t)  =  {—logt)f  By  taking  derivatives,  it  is  easy  to  see  that  H{t)  =  (— logf)^  is 
convex  on0<f<  lorf>  but  is  concave  when  1  <  f  <  et~^.  Thus,  if  we  let  to  =  then 

H{t)  <  H{to)  +  H'{to){t  —  to)  =  —Cjt  +  dj  when  1  <  f  <  fo,  where  Cj  =  —  >  0  and  dj  =  >  Q. 

A  simple  calculation  shows  that  the  line  y  =  —Cjt  +  dj  intersects  the  curve  y  =  H{t)  at  two  points:  one  of  them 
is  f  =  fo  =  >  1  and  the  other  one  is  in  the  interval  (0, 1]  and  denoted  by  f*  <  1.  Therefore,  we  can  construct 

a  function  H{t)  which  dominates  H{t): 

/ 

H{t)  =  {-\ogty  if0<f<f*(<l) 

H(t)  =  <  —Cjt  +  dj  if  f*  <  f  <  fo  =  • 

H{t)  =  {-logA  iff>fo(>l) 

v 

More  importantly,  the  function  H{t)  is  a  convex  function  on  t  >  0. 
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Recall  the  definition  of  C*  in  (6)  and  we  claim  that  H{t)  —  H{t)  <  C*.  To  prove  this  claim,  first  note  that 
C*  =  H{t*)  >  0  and  it  suffices  to  prove  the  claim  when  t*  <  t  <  to  (and  thus  H{t)  is  a  linear  function  with 
negative  slope).  The  proof  needs  to  consider  two  scenarios,  depending  on  whether  t  <  1  or>  1.  Iff*  <  t  <  1, 
then  H{t)  <  H(t*)  =  C*  while  H{t)  >  0,  and  thus  the  claim  holds.  Meanwhile,  if  1  <  f  <  fo,  then  by  taking 
derivatives,  it  is  easy  to  show  that  H{t)  —  H  (f)  is  a  decreasing  function,  and  thus 

H{t)  -  H{t)  <  H{1)  -  H{1)  =  H{1)  <  H{t*)  =  C*. 

Therefore,  our  claim  holds  and  0  <  H{t)  —  H{t)  <  C*  for  any  f  >  0. 

Relation  (7)  for  an  odd  integer  j  >  1  can  then  be  easily  proved  along  the  same  line  as  in  Theorem  1 .  It  remains  to 
show  that  for  an  odd  integer  j  >  1,  Wj{fi,  fo)  in  (4)  and  fo)  in  (5)  are  bounded  below,  since  the  random 

variables  or  may  take  positive  and  negative  values.  For  any  random  variable  X,  let  X+  =  max{X,  0}  be 
the  positive  part  of  X  and  let  X-  =  —  min{X,  0}  be  the  negative  part  of  X.  Then  X  =  X^  —  X-,  and  it  is 
evident  that  X  >  —X_.  The  following  lemma  completes  the  proof  of  Theorem  2. 

Lemma  2.  When  j  >  1  is  an  odd  integer, 

El  +  (j  -  1)^ 

where  0^  =  1  by  convention. 


Proof:  Fix  the  odd  integer  j  >  1,  consider  the  function 

■if{t)  =  —  min{0,  (—  logf)^}  =  max{0,  (logf)-^}. 


(9) 


By  taking  derivatives,  it  is  easy  to  see  that  as  a  non-decreasing  function,  ipif)  is  concave  on  f  >  fg,  where  fg  = 
Thus 

V'(fo)  if  f  <  fo 


V'(f)  < 


V'(fo) +■*/'' (fo)(f-fo)  if  f  >  fo 


or  equivalently. 


■0(f)  <  (j  -  {f  <  fo}  +  {Cjt  -  dj)I  {t  >  fg} 

where  Cj  =  —  >  0  and  dj  =  {j  —  >  0.  Recall  that  L  =  =  fi{X)/fo{X)  is  the  likelihood  ratio, 

and  thus 

Ei[{Z^)_]  =  Ei{f,{L-y) 

<  (j  -  {L-^  <  fo}  +  C,Ei  {L-^I{L  >  fg}}  -  d,Pi  {L-i  >  fg} 

<  [j-iy  +C,Ei{L-^] 

=  + 

completing  the  proof  of  the  lemma.  ■ 
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Fig.  2:  A  Decentralized  Sensor  Network 


IV.  Decentralized  Sequential  Detection 

The  problem  that  motivated  us  to  write  the  present  paper  arises  from  decentralized  sequential  detection,  see, 
Veeravalli,  Basar  and  Poor  [18],  and  Veeravalli  [14],  [16].  Fig.  2  depicts  a  typical  configuration  of  a  decentralized 
network  system  that  consists  of  K  geographically  deployed  local  sensors  , . . . ,  and  a  fusion  center.  Each 
local  sensor  observes  a  raw  data  AT*  at  time  step  n  =  1,2,...,  whereas  the  fusion  center  makes  a  final  decision 
when  stopping  taking  observations.  Due  to  constraints  on  communication  bandwidths  or  requirements  of  reliability, 
the  local  sensors  are  required  to  compress  the  raw  data  to  quantized  sensor  messages  C/^’s,  which  all  belong  to 
finite  alphabets,  say  {0, 1, —  1}  respectively,  and  then  send  the  quantized  messages  to  the  fusion  center.  In 
other  words,  the  fusion  center  has  no  direct  access  to  the  raw  observations  and  has  to  make  its  decisions  based  on 
the  quantized  sensor  messages. 

There  are  many  possible  topologies  for  the  decentralized  network  system,  and  one  widely  used  scenario  is  the 
system  with  limited  local  memory  and  full  feedback,  or  Case  E  in  Veeravalli,  Basar  and  Poor  [18].  Mathematically, 
in  this  scenario,  at  time  n  the  quantized  sensor  message  from  the  local  sensor  to  the  fusion  center  is 

(10) 

where  Fn-i  =  (here  C/j*  =  {U^, . . . ,  C/*}  )  denotes  all  past  sensor  messages  at  the  fusion  center. 

In  the  simplest  version  of  decentralized  quickest  change  detection  problems,  we  assume  that  an  event  occurs 
to  the  network  system  at  some  unknown  time  v,  and  changes  the  probability  measure  of  the  raw  data  from 
one  given  probability  measure  Pq  (with  density  /g )  to  another  given  probability  measure  Pi  (with  density  /f). 
Furthermore,  we  assume  that  the  observations  are  independent  over  time  and  from  sensor  to  sensor.  The  objective 
is  how  to  jointly  optimize  the  policies  at  the  local  sensors  and  fusion  center  levels  so  as  to  detect  the  change  as 
soon  as  possible  subject  to  a  constraint  on  the  false  alarm  rate. 

A  crucial  challenge  in  decentralized  quickest  change  detection  is  which  kind  of  local  quantizers  should  be  used 
at  each  local  sensor.  On  the  one  hand,  this  is  easy  if  one  further  assumes  that  each  local  sensor  uses  a  stationary 
local  quantizer,  as  the  corresponding  problem  reduces  to  the  classical  centralized  case  and  various  well-developed 
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optimal  or  asymptotic  optimal  theories  are  applicable,  see  for  example  Lorden  [6],  Moustakides  [9],  Page  [10], 
Poliak  [11],  Shiryayev  [12]  and  [13],  etc.  In  fact,  it  is  not  difficult  to  see  that  the  optimal  stationary  quantizer  (jf 
for  any  local  sensor  is  the  one  that  maximizes  the  local  Kullback-Leibler  divergence  /q  ),  and  it  can  be 

shown  that  such  an  optimal  quantizer  (f>*  is  a  Monotone  Likelihood  Ratio  Quantizer  (MLRQ),  see,  for  example, 
Tsitsiklis  [15],  Crow  and  Schwartz  [1],  Tartakovsky  and  Veeravalli  [14]. 

On  the  other  hand,  the  scenario  becomes  more  complicated  if  the  local  quantizers  are  allowed  to  be  non-stationary. 
By  comparing  with  Bayes  procedures,  Veeravalli  [17]  conjectures  that  the  schemes  based  on  the  optimal  stationary 
MLRQ  (j)*  are  asymptotically  optimal  regardless  whether  the  quantizers  are  stationary  or  not.  While  this  conjecture 
sounds  reasonable  as  maximizing  the  Kullback-Leibler  divergence  seems  to  be  natural  to  construct  optimal  local 
quantizers,  it  is  very  challenging  to  prove  or  disprove  it,  partly  because  of  the  regularity  conditions  of  the  quantized 
observations.  For  example,  a  sequence  of  non-stationary  quantizers  may  outperform  that  of  stationary  quantizers 
when  the  second  order  moments  of  the  log-likelihood  ratios  of  non-stationary  quantizers  can  go  to  infinity. 

Some  sufficient  conditions  under  which  this  conjecture  holds  are  available  in  the  literature.  By  Lai  [4],  this 
conjecture  is  true  under  the  following  sufficient  conditions: 

(  i^+t  K  I  ^ 


lim  sup  ess  sup  <  maxy^  >  Itot{^  +  S)n 

i—>-oo  I  t<in  ‘  ^  ‘  ^ 


i—u  k—1 


c/i,...,c/,_i  }  =0. 


(11) 


where  P^''^  is  the  probability  measure  when  the  change  occurs  at  time  v,  is  the  likelihood  ratio  for  the  quantized 
data  C/y  i.e.. 


and  hot  =  Ef=l  ^max  with  =  sup^ /(/y /y  (/)).  Here  /y(u;(/)y  is  probability  mass  function,  i.e.. 


ZU  =  log 


fSiut-Af) 


ft{u-  yy  =  pt  = «} ,  m  =  o,  i. 


Unfortunately,  condition  (11)  involves  all  possible  non-stationary  quantizers,  and  it  is  impossible  to  verify  it  directly. 
By  using  Kolmogorov’s  inequality  for  martingales,  Mei  [7]  provides  a  stronger  sufficient  condition,  and  shows  that 
the  conjecture  holds  if  there  is  a  uniform  bound  on  the  second  moments  of  the  log-likelihood  ratios  of  quantized 
observations.  Specifically,  Mei  [7]  showed  that  condition  (11)  holds  if  for  all  fc  =  1, . . . ,  7L, 

sup  iy,(/y  f^)  <  oo.  (12) 

Moreover,  condition  (12)  holds  when  the  quantized  messages  belong  to  binary  sensor  messages  with  I  =  2  and 
when  /o  and  /i  belong  to  the  same  one-parameter  exponential  family  satisfying  certain  restrictions,  see  Theorem 
2  of  [7].  However,  it  is  still  an  open  problem  whether  condition  (12)  holds  in  general  or  not,  as  the  quantizers  can 
have  arbitrary  forms  and  belong  to  the  infinite  dimensional  functional  space. 

Our  main  theorem  allows  us  to  tackle  more  general  scenarios.  Specifically,  by  Theorem  1,  if  for  all  fc  =  1, . . . ,  Tf, 

V{ft  fo  )  =  J  (log  fHx)dx  <  oo,  (13) 
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then  condition  (12)  holds  and  so  does  (11).  Note  that  condition  (13)  only  deals  with  the  densities  /*  of  raw 
observations  and  does  not  involve  the  stationary  or  non-stationary  quantizers.  Moreover,  it  is  a  standard  assumption 
in  the  statistical  literature  as  a  regularity  condition  for  the  raw  density  functions.  Therefore,  condition  (13)  provides 
a  simple  and  reasonable  sufficient  condition  under  which  the  long-standing  conjecture  of  asymptotic  optimality  of 
the  schemes  with  the  optimal  stationary  MLRQ  (jf  is  true  regardless  whether  the  quantizers  are  stationary  or  not. 

Similarly,  our  results  can  also  be  applied  to  the  problem  of  decentralized  sequential  hypotheses  testing,  see 
Veeravalli,  Basar  and  Poor  [18].  This  problem  is  similar  to  the  above-mentioned  decentralized  quickest  change 
detection  problem  except  that  the  distributions  of  the  raw  data  do  not  change  over  time.  In  other  words,  the  raw 
observations  {X^'\  form  i.i.d.  sequences  over  time  n  and  are  independent  among  different  sensors.  We  still  have 
two  simple  hypotheses  iJo  and  Hi  regarding  the  distributions  of  2f^’s,  but  the  objective  is  to  use  as  few  samples 
as  possible  to  correctly  decide  which  of  these  two  simple  hypotheses  is  true.  An  optimal  sequential  test  is  one  that 
balances  the  tradeoff  between  the  average  sample  size  under  each  hypothesis  and  the  probabilities  of  making  Type 
I  and  II  errors,  see  Veeravalli,  Basar  and  Poor  [18],  Veeravalli  [16]  and  Mei  [8]  for  more  details. 

Unlike  the  quickest  change  detection  problem,  non-stationary  quantizers  are  generally  necessary  in  order  to 
develop  asymptotically  optimal  decentralized  sequential  tests.  This  is  because  each  local  sensor  will  have  two 
kinds  of  optimal  stationary  quantizers:  one  maximizes  /^(/o  ,/f)  and  the  other  maximizes  I<p{fi ,  fo),  due  to  the 
asymmetric  properties  of  the  Kullback-Leibler  divergences.  Denote  them  by  <j>Q  and  4)\  respectively.  To  develop 
a  simple  but  asymptotically  optimal  decentralized  sequential  tests,  Mei  [8]  introduces  the  concept  of  “tandem 
quantizers”  where  the  test  procedure  is  divided  into  two  stages  (also  see  Section  IV  of  Kiefer  and  Sacks  [2]  for  a 
closely  related  experimental  design  problem).  In  the  first  stage,  any  reasonable  stationary  quantizer  is  used  and  the 
network  system  makes  a  preliminary  decision  about  which  of  two  hypothesis  is  likely  to  be  true.  Then  at  the  second 
stage,  each  local  sensor  switches  to  one  of  two  optimal  stationary  quantizers  (/>§  or  (j>\,  based  on  the  preliminary 
decision. 

It  was  shown  in  Mei  [8]  that  under  the  condition  (12)  together  with  the  symmetric  condition  with  /o  and 
/i  exchanged,  the  decentralized  sequential  tests  with  the  tandem  quantizers  are  asymptotic  optimal  among  all 
decentralized  sequential  tests  with  or  without  stationary  quantizers.  By  our  Theorem  1,  sufficient  conditions  for  the 
asymptotic  optimality  of  tests  with  tandem  quantizers  can  be  reduced  to  (13)  and  the  symmetric  condition  with  /o 
and  /i  exchanged. 


V.  Conclusion 

In  this  paper,  we  extend  the  Kullback-Leibler’s  inequality  (1)  to  deal  with  the  quantization  or  mapping  effects 
on  second  or  other  higher  moments  of  the  log-likelihood  ratios.  Our  main  results.  Theorems  1  and  2,  show  that  a 
quantization  can  increase  these  quantities  by  at  most  a  universal  constant  that  does  not  depend  on  the  distributions 
of  the  raw  observations  or  the  forms  of  the  quantizers.  The  results  are  then  used  to  provide  a  simple  but  useful 
sufficient  condition  for  asymptotic  optimality  theories  in  decentralized  sequential  detection. 
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1  Introduction 

In  this  paper  we  are  interested  in  developing  formulas  for  several  performance  indexes  concerning 
the  two  sided  CUSUM  (2-CUSUM)  test  applied  to  the  case  of  the  Brownian  Motion  (BM)  process 
with  constant  drift.  BM  models  in  connection  with  2-CUSUM  test  will  become  our  main  model 
for  the  problem  of  of  behavioral  data  modeling  treated  in  detail  later.  What  is  really  interesting  in 
our  approach  is  that  we  will  come  up  with  closed  form  expressions  for  all  our  performance  indexes 
which  then  will  be  used  to  describe  real  data  in  behavior  research. 

2  The  CUSUM  test  performance 

Let  {^}t>o  be  a  BM  with  constant  drift  satisfying  the  following  stochastic  differential  equation  (sde) 

d^t  =  fJ-dt  +  dwt  (1) 

where  {wt}t>o  is  a  standard  Wiener  process  and  fj,  is  the  constant  drift  of  the  BM.  Let  us  recall 
the  definition  and  some  basic  results  concerning  the  CUSUM  test 

Ui  =  -  yt x;  x>0 
rrit  =  minjO,  inf  Ug} 

o<s<t  (2) 

yt  =  ut-  mt 
S  =  inf{t  >  0  :  yt  >  i/}. 

Process  {yt}t>o  is  known  as  the  CUSUM  process  which,  we  observe,  it  is  started  from  a  nonnegative 
point  x;  S  is  the  CUSUM  stopping  time  and  u  the  corresponding  threshold.  Please  note  that  the 
CUSUM  test  defined  in  (2)  with  x  =  0  is  optimum  when  detecting  a  change  in  the  drift  of  the  BM 
from  0  to  A  [see  Shiryaev  (1961),  Beibel  (1961)  and  Moustakides  (2004)].  Here,  however,  we  will 
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allow  the  CUSUM  parameter  A  to  differ  from  the  actual  mean  /x  of  the  BM  and,  furthermore,  we 
are  going  to  allow  the  test  to  start  from  any  nonnegative  point  x  instead  of  the  classical  case  x  =  0. 
This  will  give  to  our  model  more  flexibility  and,  also,  allow  us  to  infer  about  the  optimality  of  the 
human-decision  mechanism,  if  we  can  verify  that  the  most  appropriate  model  is  A  =  ^. 

Using  the  formulas  developed  in  the  literature,  we  have 

^{x,  m)  =  IEq5|yo  =  x]  =  ^ ^  (3) 

LC  _  #<"-1  ^^2^ 

'ijj{x,  p)  =  |'e“^‘^|yo  =  xl  = -  (4) 

where 

/5  =  1  -  2^;  Ki  =  ]^(^±  +  ;  (5) 

and  E^[-|yo  =  x]  denotes  expectation  with  respect  to  the  measure  induced  by  the  BM  defined  in 
(1)  for  a  CUSUM  process  that  starts  from  x  >  0. 

Let  us  now  recall  the  basic  renewal  property  which  we  used  in  order  to  obtain  the  previous 
formulas 

Remark  1;  In  the  CUSUM  process  ytii  =  1,2  the  running  minimum  mt  whenever  it  changes 
it  follows  ut-  This  suggests  that  whenever  m*  changes,  we  necessarily  have 

yt  =  ut-mt  =  0, 

or  that  mt  cannot  change  outside  the  set  {yt  =  0}. 

This  property  is  very  important  for  the  classical  CUSUM  and  it  will  turn  out  to  be  equally  crucial 
for  deriving  formulas  for  2-CUSUM. 

3  The  2-CUSUM  test 

The  2-CUSUM  test  consists  of  two  one-sided  CUSUM  tests  running  in  parallel,  where  one  test 
detects  positive  changes  and  the  other  negative.  In  this  work  we  are  going  to  consider  only  the 
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following  symmetric  case 


ul  =  — jt  +  X^t  +  xi;  xi  >  0 
ml  =  min{0,  inf  u]} 

0<s<t 

vl  =  uj  -  ml 

5i  =  inf{f  >  0  :  >  u}. 


ul  = - jt-  X^t  +  X2]  X2>0 

mf  =  min{0,  inf  u^} 

0<s<t  (g) 

2  2  2 
Vt  =Ut-  rUt 

S2  =  inf{f  >0:yl>i'}. 


where  Si,  i  =  1,  2  are  the  two  one-sided  CUSUM  branches  of  the  2-CUSUM  test.  As  we  can  see,  the 
first  component  detects  positive  changes  whereas  the  second  negative.  Of  course  the  corresponding 
2-CUSUM  stopping  time  is  the  minimum  of  the  two,  that  is. 


5  =  5iA52.  (7) 

The  same  test  can  also  be  used  to  make  a  selection  between  the  positive  and  the  negative  case 
simply  by  observing  which  stopping  time  stops  first.  If  our  decision  function  is  d  G  {1,2}  where 
“1”  denotes  decision  in  favor  of  the  positive  change  and  “2”  in  favor  of  the  negative,  then  we  can 
define  d  as  follows 

{1  when  5i  <  S2 

(8) 

2  when  S2  <  5i 

This  definition  will  be  used  in  order  to  find  several  interesting  performance  indexes  associated  with 
the  2-CUSUM-based  decision  mechanism. 


3.1  Important  2-CUSUM  renewal  property 

Let  us  now  introduce  a  very  important  property  enjoyed  by  2-CUSUM  which  was  first  observed  by 
Siegmund  (1985). 

Theorem  1.  Let  yl,yl  he  the  two  CUSUM  statistics  of  a  symmetric  2-CUSUM  procedure  with 
initial  points  xi,X2  respectively.  Then  whenever  one  of  the  two  statistics  reaches  a  level  which  is 
larger  than  xi  +  X2,  for  the  first  time,  the  other  initializes  to  0. 

Proof:  Consider  the  sum  process 

U  =  vl  +  vl  =  -  ml  -  mf. 
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Note  that  Yq  =  xi  +  X2  and  let  7  >  Yq  be  a  level  which  is  larger  than  the  initial  value  of  the  sum 
process.  If  one  of  the  two  processes  reaches  the  level  7  for  the  first  time,  then  we  necessarily  have 
Yt  >  j.  In  fact  we  will  show  that  exact  equality  holds.  Let  to  be  the  time  that  Yt  reaches  for  the 
first  time  the  value  7.  Note  that  this  can  happen  only  during  an  increase  of  the  process  Yt.  During 
this  increase  if  none  of  the  two  processes  m\ ,  ml  changes  we  can  see  that  Yt  will  be  decreasing  with 
time,  contradiction.  Hence  at  least  one  of  the  two  processes  ml,  ml  changes  at  time  to-  Due  to 
Remark  1,  if  the  process  m\  changes  this  necessarily  implies  that  yl  =  0.  Note  that  at  to  we  cannot 
have  both  processes  changing  at  the  same  time  since  this  would  imply  that  =  yl^  =  or  Yt^  =  0 
which  is  again  contradiction  since  Yt^  =  7  >  0.  Therefore  we  conclude  that  exactly  one  process 
m-l  must  change  at  to.  Suppose,  for  example,  that  this  process  is  then  at  time  to,  =  0  and 
Vto  ~  7’  which  also  suggests  that  to  is  the  time  where  one  of  the  process  hits  the  level  7  for  the 
first  time  and  the  other  necessarily  initializes  to  0. 

Based  on  Theorem  1  we  have  the  following  straightforward  remark: 

Remark  2:  Let  the  symmetric  2-CUSUM  procedure  be  defined  as  in  (6)  with  initializing  points 
xi,X2.  If  the  threshold  satisfies  v  >  xi  +  X2,  then  when  one  of  the  branches  stops  first,  the  other 
branch  has  CUSUM  statistics  which  is  re-initialized  to  0. 

Remark  2  is  crucial  in  obtaining  formulas  for  the  2-CUSUM  test  by  making  use  of  the  corresponding 
formulas  for  the  one-sided  branches. 

3.2  2-CUSUM  performance 

In  this  subsection,  our  goal  is  to  find  the  performance  of  the  2-CUSUM  stopping  time  with  respect 
to  different  probability  distributions.  As  before  we  will  assume  that  the  BM  process  {^t}  has  rate 
/i  and  that  the  two  branches  start  from  different  nonnegative  values  xi,X2.  The  formulas  we  are 
going  to  develop,  as  a  result  of  Theorem  1,  will  be  valid  for  the  case  xi  -|-  X2  <  Unfortunately 
no  formula  is  known  when  xi  -|-  X2  >  This  poses  no  particular  problem  since  we  are  going  to 
consider,  mostly,  the  case  xi  =  X2  =  0. 

Again  we  borrow  ideas  from  Siegmund  (1985)  developed  for  the  discrete-time  case  and  transfer 
them  to  the  continuous-time  continuous-path  BM  case.  The  following  theorem  presents  a  collection 
of  interesting  formulas  regarding  the  symmetric  2-CUSUM. 

Theorem  2.  LetS  be  the  symmetric  2-CUSUM  with  branches  Si,  S2,  parameters  ±\,  initial  points 
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xi,X2  and  common  threshold  v,  as  in  (7).  Let  he  the  probability  measure  and  the  corre¬ 
sponding  expectation  induced  by  the  BM  process  with  rate  /x.  Define  the  functions  = 

Kf^[Si\yQ  =  Xi]  and  fifixi^fj)  =  =  Xi]  referring  to  the  one-sided  branches  according  to 

(3)  and  (f).  If  xi  -\-  X2  <  n  then  we  have  the  following  formulas  for  2-CUSUM 


IE^[‘5|2/o  =  =  X2] 


P^[5i  <  S2\yo  =  xi,yo  =  X2] 
IE''[e"^‘^|2/o  =  xi,yi  =  X2] 
IE^[e"*‘^^l{5i<52}l2/o  =  Xu  yo=  X2] 


Mxi,ia)4>2{0,ti)  +  h)Mx2,  h)  -  (/>i(0,/x)(/>2(0,^) 

4>i{0,  p)  +  (j)2{0,  p) 

(j)2{x2,p)  +  4’iiO,p)  -  (f)i{xi,p) 

</>i(0,/x)  +  </>2(0,/i) 

filjxu  pM2{0,  p)  -  1]  +  fi2{x2,  pMljO,  p)  -  1] 
V’i(o,^)V’2(o,/x)  - 1 

p)fi2{x2,  p)  -  ^pl{Xl,p) 

V’i(o,m)V’2(o,^)  - 1 


(9) 

(10) 

(11) 

(12) 


Proof.  The  proof  borrows  ideas  from  Siegmund  (1985)  developed  for  the  discrete-time  case. 
We  easily  verify  that 

5i  =  5  +  [5i  -  52]1{5^>52}- 

Applying  expectation  on  both  sides  of  the  previous  equality  we  can  write 

Mxi,h)  =  =  xi,yl  =  X2]  +IE^  [lE^[5i  -52|J'52]l{5i>52}|yd  =  xi,yo  =  2:^2] 

=  =  xi,yo  =  X2]  +  [^1  >  52|yd  =  xi,yQ  =  X2]  •  (13) 


Note  that  going  from  the  first  equation  to  the  second  we  used  the  fact  that  since  we  consider  the 
event  ^2  <  5i,  we  have  that  at  time  S2  the  first  CUSUM  branch  restarts,  therefore  IE^[5i— ^2!  Jxsj]  = 
(/>i(0,  p).  The  same  way  we  can  show  that 


Mx2,p)  =  IE^[5|2/d  =  xuVo  =  X2]  +  <(>2(0, /x)P^  [52  >  5i|2/o  =  xi,yi  =  X2]  .  (14) 


Solving  (13)  for  [5i  >  S2\yo  =  xi,  |/q  =  X2]  and  (14)  for  P^  [^2  >  5i|yQ  =  xi,  x/q  =  X2] ,  we  obtain 

P>-  [5l  >  =  x,. =  X,]  ^ 

P>‘  [5.  >  g.  lyj  =  X. .  ,§  =  X  j  ^  '■>  -  ^  ■  (16) 


Adding  the  two  equations  and  solving  for  E^[5|?/q  =  xi,yQ  =  X2]  yields  (9).  Replacing  (9)  in  (15) 
we  obtain  (10). 
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For  proving  the  other  two  relations  we  proceed  similarly.  In  particular  we  observe  that 

Taking  expectation  of  the  previous  equality  we  obtain 
V^i(xi,;u)  =  =  xi,yl  =  X2]  +  [V'i(0,/u)  -  1]IE^  [e"''‘^"l{5i>52}l2/o  =  xi,yi  =  X2]  ,  (17) 

and  in  the  same  way 

V'2(x2,/u)  =  =  xi,y‘^  =  X2]  +  [V'2(0,/u)  -  1]IE^  l{52>5i}l2/o  =  xi,yi  =  X2]  •  (18) 

Solving  for  the  last  term  in  both  equation  and  adding  the  two  equalities  we  obtain 

1  _  2  _  .  =  XU  yi  =  X2] 

E  [e  I2/0  -  XI,  yo  -  X2J  -  M0,y)-1 

i’2{xi,y)  -  IE^[e~^‘^|j/()  =  xi,yl  =  X2] 

V’2(0,^)-l 

Solving  for  the  desired  quantity  yields  (11).  Substituting  (11)  in  (18)  yields  (12). 

Theorem  2  will  become  the  basis  for  finding  formulas  for  even  more  exotic  quantities.  Such 
possibilities  could  be  the  pdf  of  the  2-CUSUM  stopping  time  or  the  conditional  pdf  of  5i  given 
that  5i  <  82-  Of  course  this  amounts  to  Laplace-inverting  the  two  functions  in  (11)  and  (12),  a 
task  which  is  not  very  straightforward,  as  we  will  find  out  next. 

4  Performance  computation 

As  we  can  see  from  Theorem  2,  performance  of  2-CUSUM  depends  on  the  drift  value  //  of  the  BM. 
We  are  going  to  consider  two  special  case.  In  the  first  we  will  compute  the  performance  of  the  test 
when  y  =  {)  and  in  the  second  y,  =  X,  i.e.  the  drift  equal  to  the  CUSUM  parameter  parameter. 
Let  us  consider  each  case  separately,  since  the  corresponding  results  tend  to  be  quite  different  in 
nature. 

4.1  Case  of  zero  drift 

The  first  step  is  to  compute  the  performance  of  the  two  branches.  Note  that  the  two  branches  differ 
in  their  CUSUM  parameter  since  the  first  uses  -|-A  and  the  second  —A.  Considering  the  case  y  =  0, 
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we  have  from  (5)  that  for  both  branches 


P  =  ^, 


This  immediately  translates  into  the  following  expressions  for  the  two  branches 


(/)i(x,0)  =  (t)2{x,{)) 
V'i(x,0)  =  'ip2{x,0) 


^  {e"  ,T)) 

K2e'^^^  — 


If  we  now  assume  that  the  two  branches  start  from  x  =  0  then 


We  can  now  compute  the  performance  of  the  2-CUSUM  test  for  xi  =  X2  =  0.  Since  i;i)i(x,0)  = 
4>2{x,0)  and  ipi{x,0)  =  ^2{x,0),  we  have  the  simplified  expressions 

>  52]  =  2 

IgOr  -s5i  ^  2  V^l(O^O)  ^  2  ~  =  2  ^ 

■01(0,  0)  +  1  /t2e'^i'^  —  +  K2  —  Ki  K2e^^’^  — 

K2  —  Kl 

E“[e-**l(*<&>l  =  ^  E»[e-‘*|Si  <  Sj]  =  E»(e-*^1. 

One  of  the  most  challenging  problems  in  this  analysis  is  finding  the  pdf  of  the  stopping  time 
S,  that  is  P°[5  G  dt]  or  the  conditional  pdf  P°[5i  G  <  52].  Obtaining  the  pdf  q^{t)  of  the 

2-CUSUM  stopping  time,  amounts  to  Laplace-inverting  the  moment  generating  function  IE®[e“^‘^]. 


(19) 

(20) 
(21) 

(22) 


4.1.1  Inverse  Fourier  transform  computation  of  the  pdf 

As  we  said,  in  order  to  find  the  probability  density  q^{t)  of  5,  we  need  to  Laplace-invert  the  function 
]g0|-g-s5]  respect  to  the  variable  s.  Clearly  q^{t)  is  a  causal  function  of  time  t,  that  is,  the 
support  of  q^{t)  is  [0, -|-oo).  Furthermore  q^{t)  is  absolutely  integrable  since  by  being  a  pdf  we 
have  l(7°(t)l  =  q^{t)  and  J^q^{t)dt  =  1.  It  is  known  that  causal  functions  that  are  absolutely 
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integrable  have  Laplace  transforms  that  converge  for  s  in  the  complex  right  half  plane  Re{s}  >  0. 
Consequently,  if  we  call 

1*00 

Q°(s)  =IE°[e-*‘^]  =  /  e-^^q^it)dt, 

Jo 

then  Q®(s)  has  all  its  poles  in  the  left  half  plane  (excluding  the  origin).  This  means  that  we  can 
apply  the  Laplace  inversion  formula  and  integrate  along  the  path  Re{s}  =  0  as  follows 

1  rO+joo  1  roo 

=  Q\s)e^^ds=—  (23) 

Jo-jco  J-oo 

with  the  last  integral  interpreted  as  an  inverse  Fourier  transform  and  obtained  by  replacing  s  with 
jfi.  The  inverse  Fourier  transform  formula  can  lend  itself  to  a  numerical  computation  of  the  pdf 


Figure  1:  Typical  form  of  the  characteristic  function  \Q^{jf2)\  of  the  2-CUSUM  stopping  time. 


Figure  2:  Typical  form  of  the  2-CUSUM  stopping  time  pdf  function  g^(0). 
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function  q^{t).  We  can  see  in  Fig.  1  the  typical  form  of  the  amplitude  of  the  function  for 

the  case  A  =  1  and  v  =  2.  The  pdf  q^{t),  it  is  known  that,  when  fi  =  0,  has  exponential  tails. 
Indeed  we  can  see  in  Fig.  2  this  fact  for  a  threshold  v  =  2.  Tail  behavior  is  usually  an  important 
issue  for  random  variables.  Therefore  in  the  next  part  we  will  follow  an  alternative  path  to  describe 
the  pdf  q^{t),  in  which  tail  behavior  will  be  more  apparent. 


4.1.2  Series  expansion  of  the  pdf 


Another  approach  for  computing  the  pdf  q^  (t)  consists  in  expanding  the  moment  generating  function 
Q^(s)  as  a  series 

oo  . 

where  Ak  =  lim  (s  -  sa:)Q°('S)- 

k=l 

Sequence  {s^}  is  the  collection  of  poles  of  the  function  Q^(s)  and  {Alfc},  as  we  can  see,  the  corre¬ 
sponding  sequence  of  residues  of  the  poles.  Please  note  that,  for  simplicity,  we  have  assumed  that 
the  poles  have  single  multiplicity  which  is  true  when  the  previous  limit  is  hnite.  Such  an  expansion, 
when  Laplace  inverted,  yields 

OO 

=  (24) 

k=l 

Equ.  (24)  can  help  in  identifying  the  exponential  tail  of  q^{t)  since  this  will  simply  correspond^  to 
the  term  assuming  that  we  have  ranked  the  poles  in  decreasing  order  regarding  their  real 

parts,  that  is,  Re{si}  >  Re{s2}  >  •  •  •  >  and  recalling  that  these  real  parts  are  negative. 

According  to  (24)  we  hrst  need  to  find  the  poles  of  the  function  Q®(s)  defined  in  (21).  Unfor¬ 
tunately,  as  we  will  see  shortly,  this  can  be  achieved  only  numerically.  However,  asymptotically  as 
1/  ^  oo,  we  will  be  able  to  describe  this  exponential  decay  in  a  more  accurate  analytical  way. 
Define  the  following  function  of  x 


V{x)  = 


(1  -h  -  (1  - 


2x 


=  e 


0.5u 


{cosh(0.5z^x)  —  X  ^  sinh(0.5z^x)}  (25) 


then 

=  /  r-V, - ■ 

v(y4Tfl)  +  i 

From  the  previous  relation  we  conclude  that  if  {xk}  is  the  collection  of  roots  of  the  equation 
V(x)  -|-  1  =  0  then  the  poles  of  Q°(s)  are  simply  Sk  =  (x|  —  l)A^/8,  A:  =  1,  2, . . ..  We  can  also  find 
^Of  course  we  need  to  prove  absolute  summability  of  the  remaining  terms. 
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for  these  poles  the  corresponding  residues  in  terms  of  Xk-  Note  that  if  we  call  x  =  \J\  +  8s/A^  then 
s  =  (x^  —  l)A^/8  and  clearly  s  ^  means  that  x  x^.  Therefore  we  can  write 

.2  _  ^2  X2  ^  ;^2 


Xk 


A  =  -  .,)Q0{s)  =  h  _lhn  =  y  ’ 


(26) 


where  V'{x)  denotes  derivative.  In  the  last  equation  we  made  use  of  Hospital’s  rule  since  V(xfc)  +  1  = 
0  and  we  assumed  that  V'{xk)  ^  0.  From  this  and  (24)  we  obtain  the  following  series  expansion  for 
9°(«) 

“ 


q 


2  ^  V'(«) 


(27) 


To  complete  the  determination  of  the  pdf  q^{t),  we  need  to  find  the  roots  {xk}  of  the  equation 
V(x)  +  1  =  0.  We  have  the  following  conjecture  for  this  problem. 

Conjecture  1.  Regarding  the  poles  of  Q.^{s)  defined  in  (21),  we  have  the  following  claims: 
i)  All  poles  of  Q^(s)  are  real  and  negative. 


a)  When  u  >  2.5569  then  Q'^(s)  has  a  pole  in  the  interval  [— A^/8,0)  and  all  other  poles  (infinite 
number)  in  the  interval  (— oo,— A^/8). 

Hi)  When  0  <  n  <  2.5569,  then  Q'^(s)  has  all  its  poles  in  the  interval  (— oo,  —  A^/8]. 


We  note  that  the  function  V(x)  is  symmetric  in  x.  This  means  that  if  Xk  is  a  root  of  V(x)  +  1  =  0, 
so  is  —Xk-  Since  both  values  produce  the  same  pole  Sk,  we  are  going  to  limit  ourselves  to  x  >  0.  Let 
us  first  seek  the  real  solutions  of  the  equation  V(x)  +  1.  Since  for  z  >  0  we  have  cosh(2;)  >  sinh(2;) 
we  conclude  that  V(x)  +  1  >  0  when  x  >  1.  Therefore  if  a  real  root  exists,  it  has  to  lie  inside  the 
interval  [0,1].  We  note  that  V(l)  +  1  =  2  >  0  while  V(0)  +  1  =  (1  —  0.5iy)e^'^’^  +  1.  The  latter  value 
is  nonpositive  when  n  >  2.5569  assuring  existence  of  a  root  inside  [0,1].  We  need  to  show  that 
this  root  is  unique  when  v  is  less  than  2.5569  and  nonexistent  when  n  is  smaller  than  this  value. 
In  Fig. 3  we  see  the  two  forms  of  the  function  V(x)  +  1  for  the  two  possibilities  of  v.  This  root  xi 
corresponds  to  a  real  pole  si  in  the  interval  [— A^/8,  0). 

Let  us  now  consider  imaginary  solutions  of  V(x)  +  1  =  0.  By  replacing  x  =  ju,  the  equation  is 
equivalent  to 


A{uj)  =  e  +  1}  =  cos{0.5uuj)  —  w  ^  sin(0.5j^a;)  +  e  =  0.  (28) 
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Figure  3:  Form  of  the  function  V(x)  +  1  for  v  >  2.5569  (blue)  and  v  <  2.5569  (black). 

where  the  hyperbolic  functions  are  transformed  into  regular  trigonometric  functions.  Fig.  4  depicts 
a  typical  form  of  the  function  A{oj).  We  can  see  that  there  is  an  infinite  number  of  roots.  In  fact 
as  u  grows,  the  roots  approach  the  roots  of  the  equation  cos(0.5z/a;)  +  =  0  which  are  easy  to 

compute. 


Figure  4:  Typical  form  of  the  function  A{oj)  and  corresponding  sequence  of  roots  {cok}  (gray  circles). 

Of  course  the  actual  computation  of  the  roots  of  the  equation  A{uj)  =  0  is  important  for  applying 
the  series  expansion  formula  in  (27).  However,  if  we  are  interested  in  describing  the  tails  of  q^{t) 
we  need  to  specify  the  smallest  pole  si.  We  note  that  whenever  V(a:)  +  1  has  a  real  root,  which 
surely  happens  when  u  >  2.5569  then  this  generates  the  smallest  pole.  Indeed  if  we  call  this  root 
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xi  then  the  corresponding  pole  is  si  =  (xf  —  l)A^/8  and  since  xi  £  [0, 1)  we  have  — A^/8  <  si  <  0. 
The  other  imaginary  roots  of  the  form  x^  =  give  rise  to  poles  of  the  form 

A2  ;y2 

Sk  =  -(Wfc  +  1)—  <  <  Si  <  0, 

suggesting  that  si  is  the  leading  pole  that  defines  the  exponential  tail  behavior  of  the  pdf  q^{t).  Of 
course  when  0  <  u  <  2.5569  then,  as  we  said,  there  is  no  real  root  for  V(x)  +  1  and  therefore  the 
tail  behavior  is  governed  by  the  smallest,  in  amplitude,  imaginary  root  x^  =  jtOk- 

As  an  example  let  us  compute  the  roots  for  the  case  u  =  2  and  the  corresponding  pdf  q^{t) 
using  the  series  expansion  (27).  From  the  above  we  know  that  we  have  only  imaginary  roots  Xk- 
The  smallest  in  amplitude  is  xi  =  jT.1198.  The  next  roots  have  the  values  j4.1081,  j8.1050, 
jlO.5259,  JT4.4438,  jT6.8434,  . . . ,  while  the  poles  Sk  become  -0.2818,  -2.2345,  -8.3363,  -13.9743, 
-26.2030,  -35.5876,  ...with  corresponding  residues  Ak  equal  to  0.3604,  -0.8281,  1.6147,  -2.0894, 
2.8633,  -3.3365,  ....  If  we  compare  the  pdf  q^{t)  obtained  using  the  series  expansion  (27)  with  the 
numerical  integration  using  (23)  the  two  curves  are  indistinguishable,  therefore  the  graph  is  the 
same  as  in  (2).  What  is  more  interesting  however  is  to  plot  the  series  expansion  using  a  logarithmic 
scale  for  q^{t)  and  compare  it  to  the  leading  exponential  term  In  Fig,  5  we  can  see  the  two 

curves.  We  observe  that  after  some  point  (in  fact  very  rapidly),  the  leading  exponential  term  (black) 
prevails,  taking  over  the  tails  (blue)  completely.  This  phenomenon  becomes  more  pronounced  if 
we  consider  large  values  for  the  threshold  v.  It  is  exactly  this  case  we  would  like  to  consider  in  the 
sequel. 


Figure  5:  Comparison  of  pdf  q^{t)  (blue)  and  leading  exponential  term  (black). 
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4.1.3  Asymptotic  performance  for  large  thresholds 


In  this  part  we  would  like  to  analyze  the  asymptotic  performance  of  the  leading  exponential  term 
for  threshold  values  that  satisfy  v  ^  1.  We  must  point  out  that  our  reasoning  is  not  going  to  he 
mathematically  rigorous.  We  will  only  sketch  the  main  idea  of  the  proof.  Let  us  begin  by  considering 
the  real  root  xi  of  the  equation  V(x)  +  1.  For  large  values  of  i/  we  observe  that  V(l)  +  1  =  2  while 
V(0)  +  1  =  S>  2  which  suggests  that  xi  must  be  closer  to  1  than  to  0.  Assuming  that  xi 

is  a  (negative)  perturbation  of  1,  that  is,  xi  =  1  —  e  where  0  <  e  <C  1  and  linearizing  the  function 
V(x)  +  1  around  x  =  1,  we  find 

e  =  4e-"(l  +  o(l)). 


This  implies  that  the  first  pole  is  extremely  small  and  equal  to  si  =  —  +  o(l)).  If  we 

substitute  this  value  in  the  formula  (26),  we  obtain  Ai  =  A^e“^(l  +  o(l))  which  is  of  the  same 
order  as  si. 

For  the  remaining  imaginary  solutions  of  the  equation,  we  can  see  that  for  large  u  and  large  w, 
the  function  A{ijj)  in  (28)  behaves  like  cos(0.5j^ti;),  which  suggests  that  ojk  ~  (2A:  —  2>)0{l)/v,  k  = 
2,3, .. .  yielding  poles  Sk  =  — A^[l  +  {2k  —  3)^0(1)/j^^]/8  and  residues  Ak  =  A^e“°'^'^(— l)^(2/c  — 
3)0(l)/i^^,  k  =  2,3,  ■  ■  ■■  In  both  cases  the  term  0(1)  is  uniform  over  k  and  u. 

We  can  now  verify  the  correctness  of  these  values  by  computing  the  average  detection  delay. 
We  have  that 


OO  j  j  OO  A 

E^k  _  I  ^k 

k=l  1  k=2 


1 


e’'{l  +  o{l))  +  e-^-^^u 


OO 

'^E 

k=2 


(-l)"(2fc- 3)0(1)  I 

y  +  {2k-3fo{i)f  y 


This  suggests  that 


1 

-  w  ^ 
1 


0(1)  E 

■'"0(1) 


A:=0 

■  0. 


(2A:  +  1)3 


For  the  last  equality  we  used  the  fact  that  I/(2Ic  +  1)^  <  oo  and  the  limit  is  for  v  — )■  co. 

We  can  therefore  see  that  the  difference  between  the  average  time  and  the  average  of  the  leading 
exponential,  tends  to  zero  as  the  threshold  increases,  even  though  both  averages  tend  to  infinity. 
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We  therefore  conclude  that  IE*^[iS]  =  e'^(l  +  o(l))/A^  and  this  conclusion  is  corroborated  by  the 
exact  formula  for  IE^[5]  in  (19).  Since  we  have  specified  a  rate  of  convergence  of  Ai/ s\  towards  the 
mean  time,  i.e.  it  is  clear  that  we  can  use  higher  order  estimates  for  ^i,  si,  up  to  and 

including  0(1)  terms.  These  terms  can  be  estimated  accurately  since  they  are  larger,  in  order  of 
magnitude,  than  the  convergence  rate. 

Regarding  the  form  of  the  pdf  under  large  values  of  the  threshold  v,  by  comparing  the  poles, 
we  arrive  at  the  following  important  conclusion: 

Remark  3:  As  the  threshold  z/  — )•  oo,  we  note  a  clear  separation  between  the  leading  pole  that 
tends  to  0  and  all  other  poles  that  tend  to  — A^/8.  This  guarantees  an  exponential  tail  behavior 
even  in  the  limit,  as  z/  — )■  oo. 

It  is  interesting  to  prove  this  remark  formally.  Borrowing  ideas  from  Taylor  (1975),  we  will  show 
that,  as  z/  — )•  oo,  a  properly  normalized  version  of  the  stopping  time  S  tends  to  a  pure  exponential. 
Indeed  consider  the  following  version  of  the  stopping  time  S  =  e~'^S.  Then 

=  Q°(se-'^). 

Fix  s,  with  Re{s}  >  0,  and  consider  the  limit 


lim  =  lim  Q^{se-'') 

Note  that  we  can  write  \AAf~e  =  1  +  0.5e  +  o(e)  and  (1  +  e)“^/^  =  1  —  0.5e  +  o(e)  which,  if  used  in 
the  previous  limit  after  observing  that  |8se“^/A^|  <C  1,  we  can  write 


lim 

U^OO 


The  latter  result  suggests  that,  in  the  limit,  e~’''S  is  exponentially  distributed  with  mean  1/A^, 
which  of  course  corresponds  to  the  leading  pole  of  Q^(s).  Consequently,  for  large  values  of  the 
threshold,  the  leading  exponential  term  prevails  completely  over  the  other  terms.  As  we  are  going 
to  see  next,  this  is  not  at  all  the  case  when  the  drift  of  the  process  is  not  equal  to  0. 
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4.2  Case  of  drift  equal  to  the  CUSUM  parameter 


Let  us  now  consider  the  case  where  the  BM  drift  fi  is  equal  to  the  CUSUM  parameter  A.  Here  the 
first  branch  with  parameter  A  must  detect  the  change  whereas  the  other  branch  with  parameter 
—A  should  not.  In  other  words  5i  <  ^2  corresponds  to  a  correct  decision  while  the  opposite  to  a 
wrong  one.  Of  course  in  this  case  the  error  probability  is  expected  to  be  (much)  less  than  0.5. 

Let  us  again  compute  the  necessary  one-sided  CUSUM  expressions.  We  start  by  computing 
the  functions  (/>i(x,A),  (/>2(x,  A),  ^pl{x,X),  ^2{x,X)-  Note  that  for  ^  =  A  we  have  p  =  —1  and 
Kip  =  0.5(— 1  ±  i/l  +  Ss/A^)  which  yields 


(j)i{x,X)  =  ^  {u  -  X  +  e  '"-e  *}  (29) 

_  (-1  +  +  (1  + 

^  (-1  +  +  8s/A2)e-°-5(i+Vi+8*A")'^  +  (1  +  y/f  +  8s/A2)e°'5(-i+Vi+8*A")^  ’ 

Similarly  for  the  second  one-sided  CUSUM  branch,  by  considering  p  =  X  and  CUSUM  parameter 
—A  we  obtain  p  =  3,  Kip  =  0.5(3  ±  Y^ir+~8s7^)-  Thus 


(j)2{x,X) 

'ip2{x,X) 


(3  -F  v^9  -h  8s/A2)e°-5(3-V9+8s/A2)x  _  (3  _  +  8s/;v2)g0.5(3-rV9+8s/A2)x 

(3  -b  1/9  +  8s/A2)e°-®(3-V9+8s/A2)!.  _  (3  _  +  8s/A2)e°-^(3+V9+8«/^")'^ 


(31) 

(32) 


Applying  the  general  relations  we  have  developed  in  Theorem  2  and  assuming  that  both  initializing 
points  satisfy  xi  =  X2  =  0,  we  end  up  with  the  following  equations 


IE^[5] 
T^[5i  >52] 


<(>1(0,  A)(/>2(0,  A) 

<('i(0)  ^)  +  4’2{9,  A) 

</>i(0,  A) 

<('i(0)  A)  -b  4>2{9,  A) 

2V>i(0,  X)M0,  A)  -  V;i(0,  A)  -  V>2(0,  A) 
•01(0,  A)V’2(0,  A)  -  1 
0i(O,  A)[02(O,  A)  -  1] 

0i(O,  A)02(O,A)  -  1  ■ 


(33) 

(34) 

(35) 

(36) 


We  can  see  that  as  far  as  the  000,  A)  functions  are  concerned,  for  threshold  S>  1,  we  have 
that 


02(0,  A)  >  0i(O,  A). 


Indeed  we  observe  that  0i(O,  A)  increases  linearly  with  1/  whereas  02(0,  A)  exponentially.  This  has 
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the  following  consequence 


(37) 

(38) 


E"|S]  =  A{^- 1  +  0(1)) 

P"|5i  >  -  1  +  0(1)}  =  +0(1)). 

In  Figs.  6  and  7  we  plot  the  exact  value  of  the  two  quantities  as  a  function  of  v  for  the  case  A  =  1. 
We  observe  a  rapid  convergence  to  the  linear  and  exponential  part  respectively. 


4.2.1  Empirical  formula 

In  the  literature  the  empirical  speed-accuracy  tradeoff  formula  has  been  observed  and  reported,  see 
Wickelgren(1977)  and  Zhang  and  Chang  (2008): 

P(correct  decision)  =  1  —  exp(— a(t  —  b)),  {t  >  b  >  0,a  >  0).  (39) 

Equivalently,  under  our  notations,  the  probability  >  ^2]  of  deciding  erroneously  is  propor¬ 
tional  to  for  properly  selected  exponent  s.  In  other  words  it  has  been  experimentally 

verified  that  the  following  relation  is  true 

Prob  {Wrong  decision}  =  >  ^2]  =  C  x 

where  s  =  a  and  C  =  exp(a6)  in  view  of  relation  (39). 

As  we  are  going  to  see,  our  analysis  supports  a  slightly  different  expression 

Prob  {Wrong  decision}  =  >  ^2]  =  C  x  IE^[5]  x 


We  must  emphasize  that  the  extra  term  IE^[5]  =  0{v')  is  not  really  important  since,  from  an 
asymptotic  point  of  view,  its  contribution  to  the  exponential  term  is  negligible. 

Adopting  this  new  model,  let  us  attempt  to  specify  analytically  s  and  C  considering  that  we 
are  in  the  asymptotic  case  u  ^  1.  Note  that  IE^[e“^‘^]  is  given  in  (35).  Assuming  that  s  is  real  and 
positive,  this  immediately  suggests  the  following  approximations 


IE^[5] 
V'i(0,  A) 
^>2(0,  A) 


^1/(1  +  0(1)) 


2y/9  +  8s/X^ 

/9  +  8s/A2  -  1' 


79 


From  the  previous  formulas  we  realize  that  1  V’i(O)  ^  ^^2(0,  A)  which  if  used  in  (35)  yields 


IE^[e-^‘5]  =  {^^1(0,  A)  +  ^>2(0,  A)}(1  +  o(l))  =  V^i(0,  A)(l  +  o(l)).  (40) 


In  order  to  specify  the  two  parameters  C  and  s  we  need  to  equate  (40)  to  (38).  Equating  first 
the  exponential  parts  and  in  particular  the  two  exponents,  results  in  the  following  relation  for  s 

0.5(\/l  +  8s/A2  -  1)  =  3 


which  yields 


s  =  6  X  A^. 


The  proportionality  constant  C  we  are  looking  for,  can  now  be  computed  as  the  limit 


C  = 


lim 

U^OO 


>  ^2] 


2 

7 


A2. 


Let  us  summarize  our  analysis  in  the  next  remark. 

Remark  4:  Our  analysis  proposes  the  following  analytic  formula  between  the  decision  error 
probability  and  the  decision  delay  S: 


2 

Prob  {Wrong  decision}  =  -A^  x  IE^[5]  x  . 


4.2.2  Pdf  computation 

We  now  come  to  the  last  part  which  is  the  computation  of  the  pdf  q^{t)  =  G  dt].  Since  we 
have  available  the  moment  generating  functions  in  (35)  it  suffices,  as  in  the  case  n  =  0,  to  Laplace 
invert  it.  For  notational  simplicity  let  us  call  Q^(s)  =  and  invert  the  function  Q^(s)  using 

the  same  inverse  Fourier  transform  technique  we  applied  for  fi  =  0.  If  in  the  formula  (35)  we  replace 
s  =  jfi  then  the  form  of  the  Fourier  transform  Q^{jO)  of  the  desired  pdf  is  very  similar  to  the  case 
Q^(jl7)  depicted  in  Fig.  2.  Inverting  the  corresponding  function  numerically  yields  the  desired  pdf 

In  Fig.  8  we  present  the  result  of  the  numerical  inverse  transformation  for  the  case  X  =  1,  i'  =  5. 
The  pdf  q^{t)  is  depicted  in  blue  and  for  comparison  we  have  also  included  the  pdf  q^{t)  (black). 
We  can  see  that  the  first  pdf  is  concentrated  on  much  lower  values  of  t  while  the  second  has  much 
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thicker  tails.  Regarding  q^{t),  as  we  can  see  it  has  a  rather  interesting  form.  The  left  tail  seems 
extremely  weak  resembling  the  tails  of  a  Gaussian,  however  the  right  tail  is  clearly  exponential. 
This  rather  peculiar  performance  has  been  observed  experimentally  in  behavior  research  (reference 
????).  This  is  a  rather  convincing  indication  that  our  proposed  2-CUSUM  model  is  very  efficient 
regarding  the  application  of  interest.  As  far  as  the  exponential  tail  is  concerned,  let  us  again  try  to 
identify  it  more  accurately  by  using  the  series  expansion  method  we  employed  in  the  case  of  q^{t). 


4.2.3  Leading  exponential  component 


As  for  we  would  like  to  apply  a  similar  series  decomposition  on  Q^(s),  that  is, 

Q^(s)  =  y^ - where  5k  =  Urn  (s  -  (7k) Q^(s).  (41) 

,s  -  ak 
k=\ 

Here  {dk}  is  the  collection  of  poles  for  the  function  Q^(s)  and  {5k}  the  corresponding  collection  of 
residuals.  As  before  we  assume  that  all  the  poles  are  simple.  Inverting  the  previous  formula  yields 


Q^{t)  = 


It  is  clear  that  our  first  step  consists  in  finding  the  poles  {(Jk}-  This  task  turns  out  to  be  more 
involved  than  in  the  //  =  0  case,  basically  because  the  two  functions  ■01  (0,  A)  and  02(0,  A)  differ 
significantly  (while  for  ^  =  0  they  are  equal).  By  consulting  (35)  we  realize  that  the  poles  are  the 
roots  of  the  equation 

0i(O,A)02(O,A)-1  =  O, 


that  belong  to  the  negative  half  plane  (the  solution  s  =  0  is  excluded  since  for  this  value  the 
numerator  of  Q^{s)  is  also  0  and  one  can  verify  that  s  =  0  is  not  a  pole).  If  we  replace  the 
functions  0i(O,  A)  from  (30)  and  (32)  we  end  up  with  the  equation  A(s)  =  0  where 
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Using  Hospital’s  rule,  we  can  now  obtain  a  more  convenient  formula  for  Bk  in  terms  of  the  function 


Bh  =  -e 


_,2-V'i(0,A)-V’2(0,A) 


H'(s) 


S — (J;, 


For  the  roots  of  the  equation  ^(s)  =  0,  we  make  the  following  conjecture: 


(44) 


Conjecture  2.  The  equation  2l(s)  =  0  with  A{s)  defined  in  (43)  has  one  root  equal  to  0  (which  is 
not  a  pole)  and  all  other  roots  are  real,  negative  and  strictly  less  than  — A^/8. 

What  is  suggested  with  the  conjecture  is  that  there  is  no  pole  which  lies  in  the  interval  [— A^/8, 0) 
(as  in  the  case  /i  =  0)  and  that  all  poles  lie  in  the  interval  (— oo,  — A^/8).  As  long  as  the  poles  are 
well  separated  which,  as  we  are  going  to  see  in  the  next  part,  happens  when  the  threshold  takes 
upon  small  to  moderate  values,  the  right-tail  behavior  is  exponential  and  governed  by  the  leading 
pole.  Let  us  therefore  compare  q^{t)  with  the  leading  exponential  term  Bie^^^  and  verify  how  well 
the  latter  describes  the  right-tail  behavior  of  the  pdf.  In  Fig.  9  we  can  see  q^(t)  for  A  =  1  and  r  =  5 
at  the  logarithmic  scale.  The  leading  pole  can  be  found  numerically  and  has  a  value  cJi  =  —0.2384 
while  the  corresponding  residue  according  to  (44)  is  Bi  =  0.6613.  We  can  see  again  the  perfect 
match  between  the  leading  exponential  term  and  the  tail  of  q^{t). 


4.2.4  Asymptotic  performance 


Fig.  10  depicts  the  typical  form  of  the  function  A(s).  We  observe  that  as  r  increases  the  leading 
pole  approaches  — A^/8.  In  fact  we  have  the  following  interesting  asymptotic  formula  for  this  pole 


A2 


(Tl  =  1  —  -h 


2tt 


dvr 


From  the  same  figure  we  also  realize  that  all  poles  approach  the  same  limit  — A^/8  as  z/  — )■  oo.  This 
has  a  very  interesting  consequence.  Even  though,  for  every  finite  value  of  r,  the  right-tail  of  the  pdf 
has  an  exponential  profile,  because  all  the  poles  tend  to  accumulate  as  z/  — )■  oo,  the  tails  gradually 
lose  their  initial  form  and  converge  to  a  Gaussian  tail.  This  can  be  clearly  seen  in  Fig.  11  where  we 
present  the  pdf  of  the  normalized  version  S  =  (S  —  IE^[5])/yT'  of  the  2-CUSUM  stopping  time  S 
for  different  values  of  n.  This  phenomenon,  as  we  said,  occurs  because  all  poles  tend  to  accumulate 
at  — A^/8  suggesting  that,  in  the  limit,  the  expansion  proposed  in  (41)  is  no  longer  valid.  We  recall 
that  in  the  case  /z  =  0  a  similar  accumulation  takes  place  as  well,  but  the  leading  pole  remains  well 
separated  from  the  other  poles,  thus  imposing  an  exponential  behavior  for  q^{t),  even  at  the  limit. 
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Borrowing  again  ideas  from  Taylor  (1975),  we  can  demonstrate  that  the  limiting  distribution  of 
the  normalized  stopping  time  S  is  indeed  Gaussian.  This  can  be  seen  by  computing  the  moment 
generating  function  of  5  =  (5  —  IE^[5])/yT' =  {S —  2v'/X‘^)/^/u-\-o{l).  Replacing  s  with  s/yT",  then 
fixing  s,  we  can  show  that 

V^i(0,  A)  =  ^  ^_3^+0(i)_ 

If  we  consider  now  IE[e  ^  ]  and  use  (35),  we  can  show  that 

IE[e”^‘^]  =  V'i(0,  A)(l  +  o(l))  = 

suggesting  that 

This  clearly  means  that  S  tends,  in  distribution,  to  a  Gaussian  with  mean  0  and  variance  4/A^. 

The  limiting  Gaussian  behavior  has  of  course  its  practical  and  theoretical  significance,  but  we 
must  keep  in  mind  that,  for  small  and  moderate  values  of  i',  the  form  of  the  right-tail  is  clearly 
exponential.  Gaussian  behavior  is  manifested  only  for  large  values  of  n.  In  fact  the  exponential 
right-tail  behavior,  as  we  said,  is  very  important  for  the  application  of  interest  since  it  was  observed 
experimentally. 
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Decision  error  probability 


Figure  6:  Average  detection  delay  as  a  function  of  the  threshold 


Figure  7:  Decision  error  probability  as  a  function  of  the  threshold  u. 
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PDF  q\t)  pdf 


Figure  8:  Pdf  q^{t)  (blue)  and  q^{t)  (black)  for  A  =  1  and  u  =  5. 


Figure  9:  Pdf  q^{t)  and  corresponding  leading  exponential  component. 
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Figure  10:  Form  of  the  function  ^(s)  and  corresponding  poles  for  different  values  of  the  threshold 

V. 


Figure  11:  Pdf  of  the  normalized  2-CUSUM  stopping  time  S  =  (5  —  IE^[5])/v^,  for  different  values 
of  the  threshold  v. 
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